# 1. Intro to the Dataset and the Aim of the EDA
\<img src="/jamboree_logo.png" alt="jamboree logo banner" style="width: 800px;"/>

> Problem faced + why to solve (ask right question to stakeholders to make this clear)
Jamboree has helped thousands of students like you make it to top colleges abroad. Be it GMAT, GRE or SAT, their unique problem-solving methods ensure maximum scores with minimum effort.

Jamboree team wants to know what factors are important for a students success in getting into an IVY league college. They also want to see if we can make a predictive model to predict the chance of admission to IVY league college using the given features.

**Dataset**

This dataset contains the details of 500 students who have applied for admission to IVY league college along with their success rate.

**Aim:** 
1. To anlyze what factors are important for a students success in getting into an IVY league college.
2. To make a predictive model to predict the chance of admission to IVY league college using the given features.

**Methods and Techniques used:** EDA, feature engineering, modeling using sklearn pipelines, hyperparameter tuning

**Measure of Performance and Minimum Threshold to reach the business objective** : RMSE of 5% on predicted vs actual price

**Assumptions**
1. This fairly small dataset (500 entries) is representative of the real world population.
2. The stability of data over time. 

## 1.1 Library Setup

In [5]:
# Scientific libraries
import numpy as np
import pandas as pd

# Logging
import logging

# Visual libraries
import matplotlib.pyplot as plt
import seaborn as sns

# Helper libraries
import urllib.request
from tqdm.notebook import tqdm, trange # Progress bar
#from colorama import Fore, Back, Style # coloured text in output
import warnings 
#warnings.filterwarnings('ignore') # ignore all warkings

# Visual setup
%config InlineBackend.figure_format = 'retina' # sets the figure format to 'retina' for high-resolution displays.

# Pandas options
from IPython.core.interactiveshell import InteractiveShell
InteractiveShell.ast_node_interactivity = 'all' # display all interaction 

# Table styles
table_styles = {
    'cerulean_palette': [
        dict(selector="th", props=[("color", "#FFFFFF"), ("background", "#004D80"), ("text-transform", "capitalize")]),
        dict(selector="td", props=[("color", "#333333")]),
        dict(selector="table", props=[("font-family", 'Arial'), ("border-collapse", "collapse")]),
        dict(selector='tr:nth-child(even)', props=[('background', '#D3EEFF')]),
        dict(selector='tr:nth-child(odd)', props=[('background', '#FFFFFF')]),
        dict(selector="th", props=[("border", "1px solid #0070BA")]),
        dict(selector="td", props=[("border", "1px solid #0070BA")]),
        dict(selector="tr:hover", props=[("background", "#80D0FF")]),
        dict(selector="tr", props=[("transition", "background 0.5s ease")]),
        dict(selector="th:hover", props=[("font-size", "1.07rem")]),
        dict(selector="th", props=[("transition", "font-size 0.5s ease-in-out")]),
        dict(selector="td:hover", props=[('font-size', '1.07rem'),('font-weight', 'bold')]),
        dict(selector="td", props=[("transition", "font-size 0.5s ease-in-out")])
    ]
}

#from rich import print # color from print statement 
# Seed value for numpy.random => makes notebooks stable across runs
np.random.seed(42)

## 1.2 Read in the Data

In [6]:
class DataHandler:
    def __init__(self, file_path : str = '../data/raw', url : str = None, output_path : str = 'data/processed'):
        if (url is None and file_path == 'data/raw') or (url is not None and file_path != '../data/raw'):
            raise ValueError('Either url or file_path must/only be specified')
        self.file_path = f"{file_path}"+f"/{url.split('/')[-1]}" if url is not None else file_path # save non default user specified path
        self.url = url
        self.output_path = output_path
    
    def download_data(self) -> None:
        logging.info(f'Downloading data from {self.url}')
        urllib.request.urlretrieve(self.url, self.file_path)

    def load_data(self) -> pd.DataFrame:
        logging.info(f'Ingesting data from {self.file_path}')
        #TODO add csv check
        return pd.read_csv(self.file_path)


data_handleUrl = DataHandler(url = 'https://d2beiqkhq929f0.cloudfront.net/public_assets/assets/000/001/839/original/Jamboree_Admission.csv')

data_handleUrl.download_data()
df = data_handleUrl.load_data()

In [7]:
df2

Unnamed: 0,Serial No.,GRE Score,TOEFL Score,University Rating,SOP,LOR,CGPA,Research,Chance of Admit
0,1,337,118,4,4.5,4.5,9.65,1,0.92
1,2,324,107,4,4.0,4.5,8.87,1,0.76
2,3,316,104,3,3.0,3.5,8.00,1,0.72
3,4,322,110,3,3.5,2.5,8.67,1,0.80
4,5,314,103,2,2.0,3.0,8.21,0,0.65
...,...,...,...,...,...,...,...,...,...
495,496,332,108,5,4.5,4.0,9.02,1,0.87
496,497,337,117,5,5.0,5.0,9.87,1,0.96
497,498,330,120,5,4.5,5.0,9.56,1,0.93
498,499,312,103,4,4.0,5.0,8.43,0,0.73


# Basic Exploration and Data wrangling 

## Basic Exploration

In [8]:
df.info()
df.describe()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 500 entries, 0 to 499
Data columns (total 9 columns):
 #   Column             Non-Null Count  Dtype  
---  ------             --------------  -----  
 0   Serial No.         500 non-null    int64  
 1   GRE Score          500 non-null    int64  
 2   TOEFL Score        500 non-null    int64  
 3   University Rating  500 non-null    int64  
 4   SOP                500 non-null    float64
 5   LOR                500 non-null    float64
 6   CGPA               500 non-null    float64
 7   Research           500 non-null    int64  
 8   Chance of Admit    500 non-null    float64
dtypes: float64(4), int64(5)
memory usage: 35.3 KB


Unnamed: 0,Serial No.,GRE Score,TOEFL Score,University Rating,SOP,LOR,CGPA,Research,Chance of Admit
count,500.0,500.0,500.0,500.0,500.0,500.0,500.0,500.0,500.0
mean,250.5,316.472,107.192,3.114,3.374,3.484,8.57644,0.56,0.72174
std,144.481833,11.295148,6.081868,1.143512,0.991004,0.92545,0.604813,0.496884,0.14114
min,1.0,290.0,92.0,1.0,1.0,1.0,6.8,0.0,0.34
25%,125.75,308.0,103.0,2.0,2.5,3.0,8.1275,0.0,0.63
50%,250.5,317.0,107.0,3.0,3.5,3.5,8.56,1.0,0.72
75%,375.25,325.0,112.0,4.0,4.0,4.0,9.04,1.0,0.82
max,500.0,340.0,120.0,5.0,5.0,5.0,9.92,1.0,0.97


**Understanding Features**

| Column              | Description         | Expected Data Type |
|---------------------|---------------------|--------------------|
| `serial_no`         | Unique row ID       | int64              |
| `gre_score`         | out of 340          | int64              |
| `toefl_score`       | out of 120          | int64              |
| `university_rating` | out of 5            | category           |
| `sop`               | out of 5            | category           |
| `lor`               | out of 5            | category           |
| `cgpa`              | out of 10           | category           |
| `research`          | either 0 or 1       | category           |
| `chance_of_admit`   | ranging from 0 to 1 | float64            |

Additional feature engineered columns:

| Column              | Description         | Expected Data Type |
|---------------------|---------------------|--------------------|
| `GRE`               | out of 340          | int64              |


## Data wrangling 