# 1. Intro to the Dataset and the Aim
\<img src="/jamboree_logo.png" alt="jamboree logo banner" style="width: 800px;"/>

Jamboree has helped thousands of students like you make it to top colleges abroad. Be it GMAT, GRE or SAT, their unique problem-solving methods ensure maximum scores with minimum effort.

Jamboree team wants to know what factors are important for a students success in getting into an IVY league college. They also want to see if we can make a predictive model to predict the chance of admission to IVY league college using the given features.

**Dataset**

This dataset contains the details of 500 students who have applied for admission to IVY league college along with their success rate.

Summary of sanitized data:
| Column              | Description         | Data Type  |
|---------------------|---------------------|------------|
| `serial_no`         | Unique row ID       | `int64`    |
| `gre_score`         | out of 340          | `int64`    |
| `toefl_score`       | out of 120          | `int64`    |
| `university_rating` | out of 5            | `category` |
| `sop`               | out of 5            | `category` |
| `lor`               | out of 5            | `category` |
| `cgpa`              | out of 10           | `category` |
| `research`          | either 0 or 1       | `category` |
| `chance_of_admit`   | ranging from 0 to 1 | `float64`  |

Additional feature engineered columns:

| Column | Description | Expected Data Type |
|--------|-------------|--------------------|
| `GRE`  | out of 340  | `int64`            |

**Aim:** 
1. To anlyze what factors are important for a students success in getting into an IVY league college.
2. To make a predictive model to predict the chance of admission (`chance_of_admit`) to IVY league college using the given features.

**Methods and Techniques used:** EDA, feature engineering, modeling using sklearn pipelines, hyperparameter tuning

**Measure of Performance and Minimum Threshold to reach the business objective** : RMSE of 5% or less

**Assumptions**
1. This fairly small dataset (500 entries) is representative of the real world population.
2. The data is stable and does not change over time. Thus model assumed to not decay. 

## 1.1 Library Setup

In [56]:
# Scientific libraries
import numpy as np
import pandas as pd

# Logging
import logging

# Visual libraries
import matplotlib.pyplot as plt
import seaborn as sns

# Helper libraries
import urllib.request
from tqdm.notebook import tqdm, trange # Progress bar
#from colorama import Fore, Back, Style # coloured text in output
import warnings 
#warnings.filterwarnings('ignore') # ignore all warkings

# Visual setup
%config InlineBackend.figure_format = 'retina' # sets the figure format to 'retina' for high-resolution displays.

# Pandas options
from IPython.core.interactiveshell import InteractiveShell
InteractiveShell.ast_node_interactivity = 'all' # display all interaction 

# Table styles
table_styles = {
    'cerulean_palette': [
        dict(selector="th", props=[("color", "#FFFFFF"), ("background", "#004D80"), ("text-transform", "capitalize")]),
        dict(selector="td", props=[("color", "#333333")]),
        dict(selector="table", props=[("font-family", 'Arial'), ("border-collapse", "collapse")]),
        dict(selector='tr:nth-child(even)', props=[('background', '#D3EEFF')]),
        dict(selector='tr:nth-child(odd)', props=[('background', '#FFFFFF')]),
        dict(selector="th", props=[("border", "1px solid #0070BA")]),
        dict(selector="td", props=[("border", "1px solid #0070BA")]),
        dict(selector="tr:hover", props=[("background", "#80D0FF")]),
        dict(selector="tr", props=[("transition", "background 0.5s ease")]),
        dict(selector="th:hover", props=[("font-size", "1.07rem")]),
        dict(selector="th", props=[("transition", "font-size 0.5s ease-in-out")]),
        dict(selector="td:hover", props=[('font-size', '1.07rem'),('font-weight', 'bold')]),
        dict(selector="td", props=[("transition", "font-size 0.5s ease-in-out")])
    ]
}

#from rich import print # color from print statement 
# Seed value for numpy.random => makes notebooks stable across runs
np.random.seed(42)

# Data Collection

## Data Ingestion

In [57]:
class DataIngestor:
    """"
    A class to handle downloading data and loading it into a pandas dataframe along with basic sanity options
    """
    def __init__(self, file_path : str = '../data/raw', url : str = None, output_path : str = 'data/processed'):
        if (url is None and file_path == 'data/raw') or (url is not None and file_path != '../data/raw'):
            raise ValueError('Either url or file_path must/only be specified')
        self.file_path = f"{file_path}"+f"/{url.split('/')[-1]}" if url is not None else file_path # save non default user specified path
        self.url = url
        self.output_path = output_path
    
    def download_data(self) -> None:
        logging.info(f'Downloading data from {self.url}')
        urllib.request.urlretrieve(self.url, self.file_path)

    def load_data(self) -> pd.DataFrame:
        logging.info(f'Ingesting data from {self.file_path}')
        #TODO add csv check
        return pd.read_csv(self.file_path)
    
    def save_data(self, df : pd.DataFrame) -> None:
        logging.info(f'Saving data to {self.output_path}')
        df.to_csv(self.output_path)
        
    def sanitize(self, df : pd.DataFrame) -> pd.DataFrame:
        """
        Rename columns to snake case and strip whitespace
        """
        return df.rename(lambda x: x.lower().strip().replace(' ', '_'),axis='columns')

In [58]:
data_url = DataIngestor(url = 'https://d2beiqkhq929f0.cloudfront.net/public_assets/assets/000/001/839/original/Jamboree_Admission.csv')

data_url.download_data()
df = data_url.sanitize(data_url.load_data())

In [59]:
df
df.info()
df.describe()

Unnamed: 0,serial_no.,gre_score,toefl_score,university_rating,sop,lor,cgpa,research,chance_of_admit
0,1,337,118,4,4.5,4.5,9.65,1,0.92
1,2,324,107,4,4.0,4.5,8.87,1,0.76
2,3,316,104,3,3.0,3.5,8.00,1,0.72
3,4,322,110,3,3.5,2.5,8.67,1,0.80
4,5,314,103,2,2.0,3.0,8.21,0,0.65
...,...,...,...,...,...,...,...,...,...
495,496,332,108,5,4.5,4.0,9.02,1,0.87
496,497,337,117,5,5.0,5.0,9.87,1,0.96
497,498,330,120,5,4.5,5.0,9.56,1,0.93
498,499,312,103,4,4.0,5.0,8.43,0,0.73


<class 'pandas.core.frame.DataFrame'>
RangeIndex: 500 entries, 0 to 499
Data columns (total 9 columns):
 #   Column             Non-Null Count  Dtype  
---  ------             --------------  -----  
 0   serial_no.         500 non-null    int64  
 1   gre_score          500 non-null    int64  
 2   toefl_score        500 non-null    int64  
 3   university_rating  500 non-null    int64  
 4   sop                500 non-null    float64
 5   lor                500 non-null    float64
 6   cgpa               500 non-null    float64
 7   research           500 non-null    int64  
 8   chance_of_admit    500 non-null    float64
dtypes: float64(4), int64(5)
memory usage: 35.3 KB


Unnamed: 0,serial_no.,gre_score,toefl_score,university_rating,sop,lor,cgpa,research,chance_of_admit
count,500.0,500.0,500.0,500.0,500.0,500.0,500.0,500.0,500.0
mean,250.5,316.472,107.192,3.114,3.374,3.484,8.57644,0.56,0.72174
std,144.481833,11.295148,6.081868,1.143512,0.991004,0.92545,0.604813,0.496884,0.14114
min,1.0,290.0,92.0,1.0,1.0,1.0,6.8,0.0,0.34
25%,125.75,308.0,103.0,2.0,2.5,3.0,8.1275,0.0,0.63
50%,250.5,317.0,107.0,3.0,3.5,3.5,8.56,1.0,0.72
75%,375.25,325.0,112.0,4.0,4.0,4.0,9.04,1.0,0.82
max,500.0,340.0,120.0,5.0,5.0,5.0,9.92,1.0,0.97


# Data Wrangling

**Data Integrity and Consistency**
- [x] Uniformity Constraint: All data should be in the same unit or format like date, currency, scales.
    - Example: Convert all date columns to a consistent format (e.g., 'YYYY-MM-DD').
    -  Remove symbols like % or $ 
    -  Remove scaling like 10 means 10k rupees
- [x] Data Type Constraint: Constrain each variable to a specific data type, and check for mixed data types within a column.
    - Example: Convert a column with mixed integers and strings to a consistent data type.
- [x] Data Range Constraints: Each variable could have a limited range like date < current date.
    - Example: Check if all birth dates are before the current date.
- Uniqueness Constraint: Duplicate entries should not be there either complete row or multiple values for the same primary key => take groupby and average.
    - Example: Check for duplicate customer IDs if there are multiple different income value / credit score then average it out 
- Text Constraint: Text should be in the expected format for that variable like phone numbers and emails following patterns and fixed lengths.
    - Example: Check if all email addresses follow the pattern `name@domain.com`.
    - Removing special characters, trimming whitespaces, and handling inconsistent capitalizations.
	- Encoding Issues: Check for consistent encoding (e.g., UTF-8, ASCII) across text data.
	    - Example: Convert all text columns to UTF-8 encoding.
- Categorical Constraint and order: 
    - Uncollapsed categories with the same or similar names
    - Rename and Specify order of category data types
    - Check for categories with very low counts (which may need to be grouped or removed).
    - Example: Group infrequent 'product_category' values into an 'Other' category, collapse 'single', 'unmarried', 's'

**Data Validation and Relationships**

- Cross-Field Validation: Compare multiple variables to get cross-field validation like delivery date >= purchase date.
    - Example: Check if the delivery date is always greater than or equal to the purchase date.
- Data Leakage: Ensure that future information is not inadvertently included in the training data (especially relevant for time series or predictive modeling).
    - Example: In a predictive model for stock prices, ensure that future stock prices are not included in the training data.
- Referential Integrity: Validate foreign key relationships between tables or datasets.
    - Example: Check if all order IDs in the orders table have a corresponding customer ID in the customers table.
- Business Rules: Check for any specific business rules or domain-specific constraints that the data should adhere to.
    - Example: In a retail dataset, check if the total order value is equal to the sum of item prices.
- **Hierarchical Validation**: Validate the hierarchical relationships within the data, such as ensuring that a subcategory belongs to the correct main category.
	- Example: Check if all 'product_subcategory' values correctly correspond to their 'product_category' values.

In [60]:
# -ve testing
# df.iloc[2,1] = 1000
# df.iloc[3,4] = 1000
# df.iloc[5,3] = 1000

class DataWrangler:
    """"
    A class to handle cleaning and wrangling data
    """
    def __init__(self, df : pd.DataFrame):
        self.df = df
        self.processed_df = pd.DataFrame()
        self.failed_index_range_constrains = pd.Index([])
        self.data_type_map = {}
        
    def clean(self) -> pd.DataFrame:
        return self.df
    
    def set_data_type(self, data_type_map : dict) -> pd.DataFrame:
        self.data_type_map = data_type_map
        self.processed_df = self.df.astype(data_type_map)
        return self.processed_df
    
    def range_constrain_check(self, constrains : dict) -> pd.Index:
        self.range_constrains = constrains
        mask = np.array([False]*len(self.df))
        for col,(min_range, max_range) in self.range_constrains.items(): 
            mask += (self.df[col] < min_range) | (self.df[col] > max_range)
        self.failed_index_range_constrains = self.df[mask].index
        return self.failed_index_range_constrains
    
    
range_constrains = {
    'gre_score': [0, 340],
    'toefl_score': [0, 120],
    'university_rating': [0, 5],
    'sop': [0, 5],
    'lor': [0, 5],
    'cgpa': [0, 10],
    'research': [0, 1]
}

data_type_map = {
    'serial_no.': 'int16',
    'gre_score': 'int16',
    'toefl_score': 'int16',
    'university_rating': 'category',
    'sop': 'category',
    'lor': 'category',
    'cgpa': 'float16',
    'research': 'int16'
}
clean_data = DataWrangler(df)
idx = clean_data.range_constrain_check(range_constrains)


df = clean_data.set_data_type(data_type_map)

In [63]:
df.info()
df

clean_data.df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 500 entries, 0 to 499
Data columns (total 9 columns):
 #   Column             Non-Null Count  Dtype   
---  ------             --------------  -----   
 0   serial_no.         500 non-null    int16   
 1   gre_score          500 non-null    int16   
 2   toefl_score        500 non-null    int16   
 3   university_rating  500 non-null    category
 4   sop                500 non-null    category
 5   lor                500 non-null    category
 6   cgpa               500 non-null    float16 
 7   research           500 non-null    int16   
 8   chance_of_admit    500 non-null    float64 
dtypes: category(3), float16(1), float64(1), int16(4)
memory usage: 11.3 KB


Unnamed: 0,serial_no.,gre_score,toefl_score,university_rating,sop,lor,cgpa,research,chance_of_admit
0,1,337,118,4,4.5,4.5,9.648438,1,0.92
1,2,324,107,4,4.0,4.5,8.867188,1,0.76
2,3,316,104,3,3.0,3.5,8.000000,1,0.72
3,4,322,110,3,3.5,2.5,8.671875,1,0.80
4,5,314,103,2,2.0,3.0,8.210938,0,0.65
...,...,...,...,...,...,...,...,...,...
495,496,332,108,5,4.5,4.0,9.023438,1,0.87
496,497,337,117,5,5.0,5.0,9.867188,1,0.96
497,498,330,120,5,4.5,5.0,9.562500,1,0.93
498,499,312,103,4,4.0,5.0,8.429688,0,0.73


<class 'pandas.core.frame.DataFrame'>
RangeIndex: 500 entries, 0 to 499
Data columns (total 9 columns):
 #   Column             Non-Null Count  Dtype  
---  ------             --------------  -----  
 0   serial_no.         500 non-null    int64  
 1   gre_score          500 non-null    int64  
 2   toefl_score        500 non-null    int64  
 3   university_rating  500 non-null    int64  
 4   sop                500 non-null    float64
 5   lor                500 non-null    float64
 6   cgpa               500 non-null    float64
 7   research           500 non-null    int64  
 8   chance_of_admit    500 non-null    float64
dtypes: float64(4), int64(5)
memory usage: 35.3 KB


## Asserts and validations
`assert bool`

# Train Test data splitting 
Separate the test data after before visualisation to avoid data snooping bias
* Note to set random seed so that train and test won't get mixed in subsequent runs
* Use a hash function in case if online training and new data is added. This is to ensure that old test data is not mixed with new training data
* If your dataset is not large enough (especially relative to the number of attributes), then you run the risk of introducing a significant sampling bias while doing random sampling => use stratified sampling
	* Find the primary predictor from business
	* Split the data into that predictor categories (strata)  such that each strata is not too small
	* Split using sklearn 

# Pipeline I : Stationary Pipeline
* Those which can be applied to both predictor and target variable (separately)

# Handling Outliers and Missing Values
- Handling Outliers:  (to be done before missing value imputation to remove effect of outliers)
	- Remove outliers: In some cases, it may be appropriate to simply remove the observations that contain outliers. This can be particularly useful if you have a large number of observations and the outliers are not true representatives of the underlying population.
	- Transform outliers: The impact of outliers can be reduced or eliminated by transforming the feature. For example, a log transformation of a feature can reduce the skewness in the data, reducing the impact of outliers.
	- Impute outliers: In this case, outliers are simply considered as missing values. You can employ various imputation techniques for missing values, such as mean, median, mode, nearest neighbor, etc., to impute the values for outliers.
	- Use robust statistical methods: Some of the statistical methods are less sensitive to outliers and can provide more reliable results when outliers are present in the data. For example, we can use median and IQR for the statistical analysis as they are not affected by the outlier’s presence. This way we can minimize the impact of outliers in statistical analysis.
	- Use discretization or binning : converting numerical variables to categorical form can result in some loss of information, as the precise numerical values within each bin are no longer distinguished thus quality will be reduced thus accuracy of ML model but good for EDA
	  Use Freedman-Diaconis rule to get bin size (same is used by sns when you give bins=n) [numpy implementation](https://medium.com/@maxmarkovvision/optimal-number-of-bins-for-histograms-3d7c48086fde) 
    - Example: Identify and remove salary values that are more than 3 standard deviations away from the mean, but are these all pertaining to a specific role ? then it might be a feature for that 
- Handling Missing Data: Impute using mean or mode with or without grouping by other categories, and check for patterns in missingness.
    - Example: Impute missing values in the 'age' column with the mean age grouped by 'gender'.
    - Check category wise missing 
	- MCAR, MAR, MNAR
	- For sting data type there could be entries like ' ' or 'unknown' like that which are essentially like a missing value (not an issue for category because we can catch it when we take value_counts())
	- Use sklearn methods like `SimpleImputer`, `KNNImputer` or `IterativeImputer` (doesn't this cause multicolinearity ?)


# Data Transformation
* Add promising transformations of features (e.g., log(x), sqrt(x), x^2, etc.) to remove the skewness 
* Feature scaling: standardize or normalize features.
	* use tranformers to do this with pipeline 
* Encode categorical data 
	* `OrdinalEncoder()` => only if categories are ordinal like bad, avg, good
	* `OneHotEncoder()` => if not ordinal or if distance between ordinal categories are not same and outputs as SciPy sparse matrix,  (stores cat details in feature_names_in_ attribute)

# Pipeline II : Feature Engineering Pipeline

# Feature Engineering
- Generating New Features: Create new features like the difference between purchase date and delivery date or segment numerical data into categorical bins. Our model should not have confounding variables issue
  > this is an iterative process: once you get a prototype up and running, you can analyze its output to gain more insights and come back to this exploration step
	- Drop the attributes that provide no useful information for the task. 
    - New Feature by changing reference point 
	    - Example: Create a new feature 'delivery_time' as the difference between 'delivery_date' and 'purchase_date'.
    - Granulize and de granulize values 
        - Granulizing
            - GDP per captia from GDP
            - Find average consumption of each district from state consumption by taking ratio of state consumption / no. of districts 
        - De graulize
            -  Sum up the value of each district to get the state consumption => for better understanding divide it by no of state to get average district spending for that state
            - AOV for each customer group => we can know how much a customer is purchasing on average 
    - Combine 2 variables to capture their relationship 
	    - See what happens to x\*y or x/y etc with this https://www.desmos.com/
	    - if income is x and y is income then y/x => lesser age people with high income will be rewarded more  <= inverse relation with age and direct with income 
	    - if x\*y => younger people with more income or older people with less income will be more rewarded (x and y have direct relationship)
	- Discretize continuous features
		- Categorize to bins such that no bin has very less values
	- Decompose features (e.g., categorical, date/time, etc.).
		- Example: the day of the week, month, or season to capture patterns

# Visualisation and Exploration
* convert pipeline output to pd.DataFrame