## This is an outline for methods.

## I `highlighted` for parts that we need to discuss/agree as a `TEAM`, `DATA` parts that need to be done via data-preprocessing, or any updates in general.
## FYI, I also used `@name` to refer to certain team members.

### 1.	Data Collection
### 2.	Feature Selection:
Identify the most relevant features from the dataset.\
Graduation Rate, Net Price, Acceptance Rate, Salary, Debt, and Size. 

### 3.	Data Preprocessing:
**`DATA:` these parts should be addressed via pre-processing so that data will be ready for modeling & UI with the right variables**\
`@Jacob @Nicolas, you can refer to this file to decide which variable that you guys want to keep during preprocessing)`

**`TEAM:` these parts should be addressed by the whole team to decide which approach to take**

#### Modeling
For modeling, need to merge MERGED & Recent Cohorts Institution to make ranked_list.\
For modeling, we don't need normalization.\
For the model, we need the **selected_features**

**selected_features & Variable names:**
- Graduation Rate: C100_4 (Completion rate for first-time, full-time students at four-year institutions **(100% of expected time to completion)**) 
- Net price: NPT4_PUB, NPT4_PRIV (Average net price for public/private)
    - `DATA:` Merge these two columns into one; we ignore public/private
- Acceptance Rate: ADM_RATE 
- Salary: MD_EARN_WNE_1YR (Median earnings of graduates working and not enrolled 1 year after completing)
    - MDEARN_ALL (Overall median earnings of students working and not enrolled 10 years after entry)
    - `NEED TO:` how to handle # years after completly/entry/etc..
- Debt: GRAD_DEBT_MDN_SUPP (Median Total Debt After Graduation for Loans Taken Out at This School) OR Cumulative loan debt at the @@th percentile
    - `NEED TO:` have to decide which variable to choose - or how to handle percentile..
- Size: GRADS (# of grad students), D_PCTPELL_PCTFLOAN (# of undergrad, denominator receiving pell grant)
    - `DATA:` A new column of total # of students (GRADS + D_PCTPELL_PCTFLOAN)

#### UI
`@Clara @Satish, please let me know any concerns or parts to edit/update`

For UI, also need to use FoS and Recent Fos along with the ranked_list (for filtering Majors)\
We need annualized metrics that show increase/decrease over years (school_history)\
For UI, we need **filtering_options** (user input) and **school_history** (that shows increase/decrease of rates for each school).

**Filtering_options & Variable names:**
- Test scores (required): SATVR25 ACTCM25,,,  
    - `NEED TO:` how to handle percentile 
- Location (required): STABBR (abbreviation of state)
- Major (required): CIPDESC (in FoS and Recent FoS files); will not use detailed version of the majors; show dropdown and user can click on the display of majors
- `NEED TO:` Other than these 3 required filters, we should pick up more optional filters (debt, net price, salary, public/private, women only, family income FAMINC etc)
    - We want to use as much as we can. Will need to finalize via exploring data
    - `NEED TO:` All team members should explore through the variables and share ideas, then we can have our final optional filters

**School_history & Variable names:**
- Admission rate: ADM_RATE  
- Net price: NPT4_PUB, NPT4_PRIV 
    - `DATA:` Merge these two columns into one; we ignore public/private (mentioned above)
- Debt
    - `NEED TO:` how to handle percentile..


### 4.	Data Analysis:
Use statistical analysis and data visualization techniques to gain insights into the relationships between different features and university quality.\ 

### 5.	Score Calculation / Ranking:
Develop a scoring algorithm that combines the selected features to create a university score. We can use various methods such as weighted averages or machine learning models to do this. For example, we might want to give more weight to Graduation Rate and Salary while considering other factors with different weights.

**Weighted Average:** \
Assign a weight to each feature, reflecting its importance, and calculate the weighted sum of the features for each university. For example, we might assign higher weights to Graduation Rate and Salary, indicating their greater importance in our ranking.

In this step, we need to develop a scoring algorithm that assigns a score to each university based on the selected features. The scoring algorithm depends on our project's objectives and the weight we assign to each feature. Here's an example using a weighted average approach:


In [None]:
# Example: Weighted Average Score Calculation

# selected features to compute score
# data: merged data of different 4 files (pre-processed)
selected_features = data["Graduation Rate", "Average Annual Cost", "Acceptance Rate", "Salary", "Debt", "Size"]

# Assign weights to each feature based on their importance
weights = {
    "Graduation Rate": 0.2,
    "Average Annual Cost": 0.15,
    "Acceptance Rate": 0.1,
    "Salary": 0.2,
    "Debt": 0.15,
    "Distance": 0.1,
    "Size": 0.1
}

# Calculate the university score
data['University Score'] = (selected_features * weights).sum(axis=1)


We can adjust the weights to reflect our priorities. The total weight should sum to 1. This approach provides a single score for each university based on the weighted average of features.


### 6.	Model:
We can build a predictive model to rank universities. Regression models, classification models, or ensemble methods can be used to predict the university score based on historical data. We can build a regression model that predicts the university score directly. Regression models like linear regression, decision trees, or even more advanced methods like random forests or gradient boosting can be used. In this approach, the model learns the relationship between the features and the score from historical data.


**Feature Engineering? (similar to 'score'):**\
Create any additional features or modify existing features that may improve the model's performance. For example, we can create new features like "Student-to-Faculty Ratio" or "Student Satisfaction Index" by combining existing data.

**Model - Gradient Boosting Regressors (ensemble method):** \
Ensemble methods are a powerful approach to improving the accuracy and robustness of regression models; often outperform individual models, providing better predictive accuracy. \
Gradient boosting is known for its high predictive accuracy. It can capture complex relationships between multiple features and university scores.\
They work by training multiple weak learners sequentially and combining their predictions. They are powerful and can capture complex non-linear relationships.

Gradient boosting is robust against outliers and can handle noisy data effectively.
We can easily fine-tune hyperparameters to optimize the model's performance and control overfitting.
Gradient boosting is efficient and can handle large datasets.

**Algorithms for GB:** XGBoost, LightGBM, and CatBoost (for Categorical features,,)

**LightGBM** is known for its speed and efficiency, making it a good choice for large datasets.\
It uses a leaf-wise tree growth strategy that can lead to faster training times.\
XGBoost and CatBoost are also efficient, but LightGBM is often faster in terms of training speed.\
LightGBM is designed to be memory-efficient, which is beneficial when dealing with large datasets. It optimizes memory usage during training.

While LightGBM is a strong option, it's still a good practice to experiment with different models and fine-tune hyperparameters to ensure that it performs well on your specific dataset. However, given the size of our data (millions of rows), LightGBM's efficiency and speed make it a sensible choice.

** Split our dataset into a training set and a testing set to train and evaluate our model. (70-30 or **80-20**.)


In [None]:
# pip install lightgbm
import lightgbm as lgb
import numpy as np
from sklearn.metrics import mean_absolute_error, mean_squared_error, r2_score
from sklearn.model_selection import GridSearchCV
from sklearn.model_selection import train_test_split


# data loaded into X and y
# X contains features (independent variables)

# y contains target variable (university scores)
y = data['University Score']

# Split the data into training and testing sets (adjust the test_size as needed)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)

# You can also specify a random_state to ensure reproducibility by setting it to a fixed value.

# Now, X_train and y_train are your training data, and X_test and y_test are your testing data.


# Create a LightGBM dataset
train_data = lgb.Dataset(X_train, label=y_train)

# Define model parameters
params = {
    'objective': 'regression',  # Regression task
    'boosting_type': 'gbdt',    # Gradient Boosting Decision Tree
    'metric': {'l2', 'l1'},    # MAE (mean absolute error) and MSE (mean squared error)
    'num_leaves': 31,           # Maximum number of leaves in one tree **
    'learning_rate': 0.05,      # learning rate **
    'feature_fraction': 0.9,   # Percentage of features to use per tree **
    'bagging_fraction': 0.8,   # Percentage of data to bag per iteration **
    'bagging_freq': 5,         # Bagging frequency **
    'verbose': 0               # No output during training
}

# Train the model
num_boost_round = 100  # Number of boosting rounds (adjust this) **
lgb_model = lgb.train(params, train_data, num_boost_round=num_boost_round)


**Hyperparameter Tuning:** \
Optimize the model's hyperparameters to improve its performance. This may involve **grid search**, random search, or other hyperparameter optimization techniques.

We can build a predictive model to rank universities based on their scores. Here's an example:

In [None]:
# Define the hyperparameter grid
param_grid = {
    'num_leaves': [31, 50, 100],
    'learning_rate': [0.05, 0.1, 0.2],
    'feature_fraction': [0.8, 0.9, 1.0],
    'bagging_fraction': [0.7, 0.8, 0.9]
}

# Create a Grid Search model
grid_search = GridSearchCV(lgb_model, param_grid, cv=5, verbose=1, n_jobs=-1)

# Fit the search on your training data
grid_search.fit(X_train, y_train)

# Get the best hyperparameters
best_params = grid_search.best_params_


### 7.	Model Evaluation:
Evaluate the model's performance using the testing dataset. Common regression evaluation metrics include Mean Absolute Error (MAE), Mean Squared Error (MSE), Root Mean Squared Error (RMSE), and R-squared (R2). The choice of metric depends on our project's goals. 

Calculate the Mean Absolute Error (MAE) and Mean Squared Error (MSE) to measure the average and squared differences between the predicted scores and the actual scores. Lower values indicate a better model.\
RMSE is the square root of MSE and provides a more interpretable error metric in the same unit as our scores.\
R-squared measures the proportion of variance in the scores explained by the model. A higher R2 value indicates a better fit.

**Cross-Validation:** \
To ensure that our model generalizes well to new data, perform cross-validation, such as k-fold cross-validation or `Monte Carlo`. It divides the dataset into k subsets, trains and tests the model k times, and computes the average performance metrics. 

Assess the model's bias-variance trade-off. A model with high bias underfits the data, while a model with high variance overfits. We want to strike a balance.

**Visualizations:** \
Create visualizations like scatter plots to compare predicted scores with actual scores, helping to identify areas where the model might be performing poorly.

**Hyperparameter Tuning - again (hoping this would not happen):** \
If the model performance is not satisfactory, we may need to revisit feature selection, feature engineering, and model selection or consider collecting additional data.

`In case the results are pretty enough, we can add that if GB doesn't behave well, we might consider running a new model in random forest to the progress report in this part.`\
`-> mentioned below as well, but will think about this while writing the report to finalize`

Once we've built a model, we need to evaluate its performance. Here's how to do that:

In [None]:
import pandas as pd

# Make predictions on the test data
y_pred = lgb_model.predict(X_test)  # Use the best model using the tuned parameters from grid search

# Calculate evaluation metrics
mae = mean_absolute_error(y_test, y_pred)
mse = mean_squared_error(y_test, y_pred)
rmse = np.sqrt(mse)
r2 = r2_score(y_test, y_pred)

print(f"Mean Absolute Error (MAE): {mae}")
print(f"Mean Squared Error (MSE): {mse}")
print(f"Root Mean Squared Error (RMSE): {rmse}")
print(f"R-squared (R2): {r2}")

# Create a DataFrame to store the universities and their predicted scores
ranking_df = pd.DataFrame({'University': X_test['University'], 'Predicted_Score': y_pred})

# Sort the DataFrame by predicted scores in descending order to get the ranked list
ranked_universities = ranking_df.sort_values(by='Predicted_Score', ascending=False)

# Display the ranked universities
print(ranked_universities)

Here, we calculate common regression metrics such as MAE, MSE, RMSE, and R2. These metrics help assess how well our model performs at predicting university scores. Lower MAE and RMSE values and higher R2 values indicate better model performance.

These are just simplified code snippets to give us an idea of the process. We'll likely need to fine-tune our model, use cross-validation, and conduct a more in-depth analysis of our data to ensure the accuracy of our university rankings. Additionally, we may want to create visualizations to communicate the results effectively.

**Post-ranking model:**\
Once the model is good to go, we will have the ranking list of millions rows. Finally, we will filter the ranking list upon the user's filting options.

`For sake of the outputs on the UI, we need to have a method that shows a single (selected by user from top 10) school's name, location; admission rate, net price, debt over years (median/mean).`

#### `Update`
`actually how about having baseline models along with ensemble methods in the beginning and then compare to pick the best??` \
`-> will think about this more while writing report (mentioned above)`

### 8.	Visualization:
Create visual representations of the university scores and rankings. This can include bar charts, scatter plots, and interactive dashboards for better data communication.

### 9.	User Interface:
If our goal is to make the results accessible to a wider audience, consider building a user-friendly interface or a website where users can input their preferences and see university rankings based on their criteria.