## Title: College Admissions Data Mining Analysis Report

## Collaborators: Tarik Kose, Dawood Chaudry, Tapasya Sharma, Harsh Sharma

#### Date: 12/08/2021
---

#### Abstract
---
*(Briefly summarize the project including the problem, data sets, and final findings.)* 

Correlations among statistics on colleges is a popular subject. What kind of patterns exist between facts of tuition, tuition rise, geographic location of institutions, standardized testing scores, and enrollment size? How does one factor influence another? We want to find out if we can predict various information regarding educational institutions using other existing information. If a new institution is being considered, what can be expected regarding their statistics? This project touches upon a few different factors and performs various data mining applications to answer those questions.
A single data set, found on Kaggle, began as inspiration for this project. We later split that data set into smaller, more specific sections to suit the needs of specific questions. 

We found that although our questions/problems may be answerable, the resources and knowledge we have are very limited to build close-to-perfect models. However, with more time, guidance, fine-tweaking, and resources, the answers to these problems would be very interesting to learn and discover.


### 1. Introduction
---
*(Introduce the project and describe the objectives. This section will provide an overview of the entire project including the description of the data sets and the specific data miming methods and techniques the team used.)* 

This project focused on various statistics involving a huge dataset filled with college admissions information. There are over 1000 colleges listed and over 100 rows of information. We used many of the rows to construct the models you will observe below. 

We tried to answer 3 different questions with this project. How can we predict if a university is a public or private university through columns such as geographic region, and level of institution? How can we predict the total cost of enrollment (tuition and living) of universities through columns such as admission rate and enrollment rate? How can we use SAT scores data to classify universities within a given geographic region?

We utilized KNN classification and regression, decision trees, random forests, vectorization, data processing and feature engineering, cross-validation, and ensemble learning models. Although we used different variations of code and samples for each problem, our project as a whole has a story to tell.
It is important to keep in mind that we are not very deep in our data science experiences and can only apply the material that we learned. All of the processes we used were researched and we referred to StackOverflow on multiple occasions for troubling error codes. We ended up excluding various pieces of code that did not work and had to cut down on the scope of our project due to time constraints, but our present delivery satisfies the requirements of this project. There is always room for improvement. 


### 2. Problem Definition
---
*(Define the problem that will be solved in this data project.)*

###1) Can we predict if a university is a public or private university through columns such as geographic region, level of institution etc?
One of our objectives is to focus on the fact that all of the schools in the datasets are either public or private. To do that, we first had to get rid of data that was irrelevant and concentrate on the data we could use to bring out interesting facts and come up with some predictions. We decided to remove the first column (which had the names of all of the universities) because the names of the universities are insignificant to the rest of the data if we look at it from a data mining perspective. We also wanted to be able to predict whether or not a school would be public or private based on the rest of the data, and the name is sometimes enough to figure out if it is public or private.
The columns we are going to focus on are as follows:
1.	State
2.	Geographic Region
3.	Enrollment
4.	Carnegie Classification
5.	Urbanization

We will use this data to put together decision trees and random forest classifiers. For the decision trees, we want to know “Is the school public or private?”. We will give it training data and then evaluate predictions. We have 50 states, so first we will look at how the state impacts the likelihood of a school being public vs. private. For example, if there are 100 schools in Arizona and 98 of them are public, then the likelihood of the test school being public is 98%. If public schools are likely to have more enrollment than private schools, we will find the median value and use that for a branch. Each demographic listed above will likely have an impact on the school’s status.

###2) Can we predict the total cost of enrollment (tuition and living) of universities through columns such as admission rate, enrollment rate, etc.?
By creating a linear regression model and KNN regressor, we plan to predict the total cost of enrollment for universities. The columns that we are considering for our analysis include: 
1.	Admission Rate  (Calculated by the Applicants total column and the Admissions total column)
2.	Enrollment Rate (Calculated by the Enrolled total column and the Admissions total column)
3.	SAT Percentile Scores 
4.	ACT Percentile Scores
5.	Percentage Increase in tuition from past years 
6.	State the university is in

This data is especially helpful for prospective students who are concerned about the financial aspect of their college admissions process. Understanding how much money a school may charge students and which geographic regions are more expensive in their yearly tuition fees can help applicants make more informed decisions about their top schools.

###3) Can we use SAT scores data to classify universities within a given geographic region?
Currently, for the training data, we have the SAT scores for each university (math, reading, and writing sections) and their geographic region. Based on this data, we attempt to find a correlation and use KNN, decision tree, random forest, cross-validation, and ensemble learning models to classify these scores within a given geographic region. Collectively for a given geographic region, we will use feature engineering to create new features which are more meaningful for possibly predicting the data. Some of the new columns that we might create for classification based learning might include which of the SAT sections has a higher score as compared to others for a given geographic region. For instance, it might be that the Eastern region has higher math SAT scores as compared to reading and writing. Using this information and other column data such as number of students enrolled, accepted, and admitted, we can try to classify whether a given SAT score for an unknown university belongs in one of the appropriate geographic regions provided in the data. 

This can later be evaluated using techniques such as cross validation, precision, recall, and information gain for a given decision tree that works through the different columns specified above to classify a given SAT score to a particular geographic region. Furthermore, when decision trees are involved, we can use the random forest model to create a bunch of decision trees and use voting to find out which one best segregates or classifies the test data. Cross validation will be used to appropriately take a portion of the data and separate it for testing from the training data. 

This is important to analyze because it would let us know of the patterns using association rule mining to some extent to figure out if some correlation exists between the data. Through this information, universities that fall in a given geographic region can see what changes they need to make to their study programs to better improve the skills of their students based on the SAT scores. So, if math is a weaker link across a given geographic region with other correlating parameters, then the university can put more foundation math courses to better improve their students’ experience.




## Initial Expected Results

Using the decision trees and the random forest classifiers, we expect to be able to predict a school’s status (public vs. private) without knowing what school it is. Through the linear regression model and KNN regressor model, we hope to predict tuition fees of universities and aim to have an initial accuracy of at least 70% in all the 4 models that we incorporate. We also plan to evaluate the performances of the 4 models and establish the best model that answers the questions that we have asked the data.  

### 3. Data Sources
---
*(Describe the origin of the data sources. Load the data into appropriate format.)*

The data set is taken from https://www.kaggle.com/samsonqian/college-admissions

This dataset is titled *“College Admissions: Admission/Class Demographics by University.”* It was assembled by Samson Qian and uploaded on November 26, 2018, onto Kaggle. It is a CSV file.

The dataset has 108 columns and 1534 rows of data. Each row (except for the first one which is a header) corresponds to a university. Although a majority of the universities are in the United States, there is some data on other countries from all over the world, including Cambridge College and Ottawa University. Each column (except for the first one which is a list of the universities) corresponds to some kind of information regarding the college. Obviously, 108 columns consisting of information on colleges is a huge amount of information. However, a quick overview of the data will show that there is information on:

- Applications
- Admissions
- Graduate and undergraduate enrollment
- Exam scores (ACT, SAT)
- Graduation rate
- Tuition (for different years)
- Total cost for in-state & out-of-state students
- Location (state, geographic area)
- Level of Institution (2-year, 4-year)
- Public or Private
- Demographics (such as the size of city, enrollment ethnicity, gender, first-generation students)
- Financial aid
- Endowment

It is important to note that there are cells with no information (NA/blank). This can likely be attributed to a lack of public information regarding that data. Otherwise, the dataset looks complete and is very detailed. 


In [None]:
from google.colab import files
import pandas as pd

In [None]:
### For this code, please upload the Data-Table 1.csv dataset from the folder.
files.upload()

Saving Data-Table 1.csv to Data-Table 1 (6).csv


{'Data-Table 1.csv': b'Name,Applicants total,Admissions total,Enrolled total,Percent of freshmen submitting SAT scores,Percent of freshmen submitting ACT scores,SAT Critical Reading 25th percentile score,SAT Critical Reading 75th percentile score,SAT Math 25th percentile score,SAT Math 75th percentile score,SAT Writing 25th percentile score,SAT Writing 75th percentile score,ACT Composite 25th percentile score,ACT Composite 75th percentile score,"Estimated enrollment, total","Estimated enrollment, full time","Estimated enrollment, part time","Estimated undergraduate enrollment, total","Estimated undergraduate enrollment, full time","Estimated undergraduate enrollment, part time","Estimated freshman undergraduate enrollment, total","Estimated freshman enrollment, full time","Estimated freshman enrollment, part time","Estimated graduate enrollment, total","Estimated graduate enrollment, full time","Estimated graduate enrollment, part time",Number of students receiving an Associate\'s degr

In [None]:
#importing data
df = pd.read_csv(r'Data-Table 1.csv')

### 4. Exploring and Visualizing Data
---
*(Explore the data by analyzing its statistics and visualizing the values of features and correlations between different features.)*

In [None]:
#run if the latest version of pandas profiling doesn't already exist for you
#remember to restart runtime after upgrading
#necessary to upgrade because older versions have deprecated elements
!pip install --upgrade pandas_profiling



In [None]:
import pandas_profiling 

In [None]:
df.head()

Unnamed: 0,Name,Applicants total,Admissions total,Enrolled total,Percent of freshmen submitting SAT scores,Percent of freshmen submitting ACT scores,SAT Critical Reading 25th percentile score,SAT Critical Reading 75th percentile score,SAT Math 25th percentile score,SAT Math 75th percentile score,SAT Writing 25th percentile score,SAT Writing 75th percentile score,ACT Composite 25th percentile score,ACT Composite 75th percentile score,"Estimated enrollment, total","Estimated enrollment, full time","Estimated enrollment, part time","Estimated undergraduate enrollment, total","Estimated undergraduate enrollment, full time","Estimated undergraduate enrollment, part time","Estimated freshman undergraduate enrollment, total","Estimated freshman enrollment, full time","Estimated freshman enrollment, part time","Estimated graduate enrollment, total","Estimated graduate enrollment, full time","Estimated graduate enrollment, part time",Number of students receiving an Associate's degree,Percent admitted - total,Admissions yield - total,"Tuition and fees, 2010-11","Tuition and fees, 2011-12","Tuition and fees, 2012-13","Tuition and fees, 2013-14",Total price for in-state students living on campus 2013-14,Total price for out-of-state students living on campus 2013-14,State abbreviation,FIPS state code,Geographic region,Sector of institution,Level of institution,...,Percent of undergraduate enrollment that are White,Percent of undergraduate enrollment that are two or more races,Percent of undergraduate enrollment that are Race/ethnicity unknown,Percent of undergraduate enrollment that are Nonresident Alien,Percent of undergraduate enrollment that are Asian/Native Hawaiian/Pacific Islander,Percent of undergraduate enrollment that are women,Percent of graduate enrollment that are American Indian or Alaska Native,Percent of graduate enrollment that are Asian,Percent of graduate enrollment that are Black or African American,Percent of graduate enrollment that are Hispanic/Latino,Percent of graduate enrollment that are Native Hawaiian or Other Pacific Islander,Percent of graduate enrollment that are White,Percent of graduate enrollment that are two or more races,Percent of graduate enrollment that are Race/ethnicity unknown,Percent of graduate enrollment that are Nonresident Alien,Percent of graduate enrollment that are Asian/Native Hawaiian/Pacific Islander,Percent of graduate enrollment that are women,Number of first-time undergraduates - in-state,Percent of first-time undergraduates - in-state,Number of first-time undergraduates - out-of-state,Percent of first-time undergraduates - out-of-state,Number of first-time undergraduates - foreign countries,Percent of first-time undergraduates - foreign countries,Number of first-time undergraduates - residence unknown,Percent of first-time undergraduates - residence unknown,"Graduation rate - Bachelor degree within 4 years, total","Graduation rate - Bachelor degree within 5 years, total","Graduation rate - Bachelor degree within 6 years, total",Percent of freshmen receiving any financial aid,"Percent of freshmen receiving federal, state, local or institutional grant aid",Percent of freshmen receiving federal grant aid,Percent of freshmen receiving Pell grants,Percent of freshmen receiving other federal grant aid,Percent of freshmen receiving state/local grant aid,Percent of freshmen receiving institutional grant aid,Percent of freshmen receiving student loan aid,Percent of freshmen receiving federal student loans,Percent of freshmen receiving other loan aid,Endowment assets (year end) per FTE enrollment (GASB),Endowment assets (year end) per FTE enrollment (FASB)
0,Alabama A & M University,6142.0,5521.0,1104.0,15.0,88.0,370.0,450.0,350.0,450.0,,,15.0,19.0,5024.0,4442.0,582.0,4055.0,3802.0,253.0,1104.0,1089.0,15.0,969.0,640.0,329.0,0.0,90.0,20.0,5800.0,6828.0,7182.0,7182.0,21849.0,27441.0,Alabama,Alabama,Southeast AL AR FL GA KY LA MS NC SC TN VA WV,"Public, 4-year or above",Four or more years,...,3.0,0.0,1.0,0.0,0.0,51.0,0.0,2.0,81.0,1.0,0.0,15.0,0.0,1.0,0.0,2.0,69.0,,,,,,,,,10.0,23.0,29.0,97.0,89.0,81.0,81.0,7.0,1.0,32.0,89.0,89.0,1.0,,
1,University of Alabama at Birmingham,5689.0,4934.0,1773.0,6.0,93.0,520.0,640.0,520.0,650.0,,,22.0,28.0,18568.0,11961.0,6607.0,11502.0,8357.0,3145.0,1773.0,1732.0,41.0,7066.0,3604.0,3462.0,0.0,87.0,36.0,5806.0,6264.0,6798.0,7206.0,22495.0,31687.0,Alabama,Alabama,Southeast AL AR FL GA KY LA MS NC SC TN VA WV,"Public, 4-year or above",Four or more years,...,60.0,3.0,1.0,2.0,5.0,58.0,0.0,4.0,14.0,3.0,0.0,70.0,2.0,1.0,6.0,4.0,64.0,1529.0,86.0,224.0,13.0,19.0,1.0,1.0,0.0,29.0,46.0,53.0,90.0,79.0,36.0,36.0,10.0,0.0,60.0,56.0,55.0,5.0,24136.0,
2,Amridge University,,,,,,,,,,,,,,626.0,326.0,300.0,313.0,202.0,111.0,6.0,3.0,3.0,313.0,124.0,189.0,5.0,,,8360.0,8720.0,6800.0,6870.0,,,Alabama,Alabama,Southeast AL AR FL GA KY LA MS NC SC TN VA WV,"Private not-for-profit, 4-year or above",Four or more years,...,29.0,0.0,27.0,0.0,1.0,61.0,0.0,0.0,37.0,1.0,0.0,32.0,0.0,29.0,0.0,0.0,55.0,,,,,,,,,0.0,0.0,67.0,100.0,90.0,90.0,90.0,0.0,40.0,90.0,100.0,100.0,0.0,,302.0
3,University of Alabama in Huntsville,2054.0,1656.0,651.0,34.0,94.0,510.0,640.0,510.0,650.0,,,23.0,29.0,7376.0,4802.0,2574.0,5696.0,4237.0,1459.0,651.0,638.0,13.0,1680.0,565.0,1115.0,0.0,81.0,39.0,7492.0,8094.0,8794.0,9192.0,23466.0,35780.0,Alabama,Alabama,Southeast AL AR FL GA KY LA MS NC SC TN VA WV,"Public, 4-year or above",Four or more years,...,70.0,2.0,3.0,4.0,4.0,44.0,1.0,4.0,7.0,2.0,0.0,69.0,1.0,3.0,14.0,4.0,43.0,514.0,79.0,92.0,14.0,27.0,4.0,18.0,3.0,16.0,37.0,48.0,87.0,77.0,31.0,31.0,4.0,1.0,63.0,46.0,46.0,3.0,11502.0,
4,Alabama State University,10245.0,5251.0,1479.0,18.0,87.0,380.0,480.0,370.0,480.0,,,15.0,19.0,6076.0,5183.0,893.0,5357.0,4873.0,484.0,1542.0,1517.0,25.0,719.0,310.0,409.0,0.0,51.0,28.0,7164.0,8082.0,7932.0,8720.0,18286.0,25222.0,Alabama,Alabama,Southeast AL AR FL GA KY LA MS NC SC TN VA WV,"Public, 4-year or above",Four or more years,...,2.0,1.0,1.0,2.0,0.0,59.0,1.0,1.0,77.0,1.0,0.0,17.0,1.0,1.0,1.0,1.0,71.0,903.0,58.0,571.0,37.0,67.0,4.0,4.0,0.0,9.0,19.0,25.0,93.0,87.0,76.0,76.0,13.0,11.0,34.0,81.0,81.0,0.0,13202.0,


In [None]:
df.tail()

Unnamed: 0,Name,Applicants total,Admissions total,Enrolled total,Percent of freshmen submitting SAT scores,Percent of freshmen submitting ACT scores,SAT Critical Reading 25th percentile score,SAT Critical Reading 75th percentile score,SAT Math 25th percentile score,SAT Math 75th percentile score,SAT Writing 25th percentile score,SAT Writing 75th percentile score,ACT Composite 25th percentile score,ACT Composite 75th percentile score,"Estimated enrollment, total","Estimated enrollment, full time","Estimated enrollment, part time","Estimated undergraduate enrollment, total","Estimated undergraduate enrollment, full time","Estimated undergraduate enrollment, part time","Estimated freshman undergraduate enrollment, total","Estimated freshman enrollment, full time","Estimated freshman enrollment, part time","Estimated graduate enrollment, total","Estimated graduate enrollment, full time","Estimated graduate enrollment, part time",Number of students receiving an Associate's degree,Percent admitted - total,Admissions yield - total,"Tuition and fees, 2010-11","Tuition and fees, 2011-12","Tuition and fees, 2012-13","Tuition and fees, 2013-14",Total price for in-state students living on campus 2013-14,Total price for out-of-state students living on campus 2013-14,State abbreviation,FIPS state code,Geographic region,Sector of institution,Level of institution,...,Percent of undergraduate enrollment that are White,Percent of undergraduate enrollment that are two or more races,Percent of undergraduate enrollment that are Race/ethnicity unknown,Percent of undergraduate enrollment that are Nonresident Alien,Percent of undergraduate enrollment that are Asian/Native Hawaiian/Pacific Islander,Percent of undergraduate enrollment that are women,Percent of graduate enrollment that are American Indian or Alaska Native,Percent of graduate enrollment that are Asian,Percent of graduate enrollment that are Black or African American,Percent of graduate enrollment that are Hispanic/Latino,Percent of graduate enrollment that are Native Hawaiian or Other Pacific Islander,Percent of graduate enrollment that are White,Percent of graduate enrollment that are two or more races,Percent of graduate enrollment that are Race/ethnicity unknown,Percent of graduate enrollment that are Nonresident Alien,Percent of graduate enrollment that are Asian/Native Hawaiian/Pacific Islander,Percent of graduate enrollment that are women,Number of first-time undergraduates - in-state,Percent of first-time undergraduates - in-state,Number of first-time undergraduates - out-of-state,Percent of first-time undergraduates - out-of-state,Number of first-time undergraduates - foreign countries,Percent of first-time undergraduates - foreign countries,Number of first-time undergraduates - residence unknown,Percent of first-time undergraduates - residence unknown,"Graduation rate - Bachelor degree within 4 years, total","Graduation rate - Bachelor degree within 5 years, total","Graduation rate - Bachelor degree within 6 years, total",Percent of freshmen receiving any financial aid,"Percent of freshmen receiving federal, state, local or institutional grant aid",Percent of freshmen receiving federal grant aid,Percent of freshmen receiving Pell grants,Percent of freshmen receiving other federal grant aid,Percent of freshmen receiving state/local grant aid,Percent of freshmen receiving institutional grant aid,Percent of freshmen receiving student loan aid,Percent of freshmen receiving federal student loans,Percent of freshmen receiving other loan aid,Endowment assets (year end) per FTE enrollment (GASB),Endowment assets (year end) per FTE enrollment (FASB)
1529,University of South Florida-Sarasota-Manatee,393.0,136.0,86.0,79.0,59.0,502.0,580.0,490.0,570.0,490.0,570.0,22.0,25.0,1889.0,873.0,1016.0,1739.0,835.0,904.0,86.0,82.0,4.0,150.0,38.0,112.0,0.0,35.0,63.0,,,,5587.0,,,Florida,Florida,Southeast AL AR FL GA KY LA MS NC SC TN VA WV,"Public, 4-year or above",Four or more years,...,73.0,2.0,2.0,1.0,3.0,59.0,1.0,2.0,5.0,9.0,1.0,74.0,1.0,1.0,4.0,3.0,65.0,78.0,94.0,2.0,2.0,2.0,2.0,1.0,1.0,,,,,,,,,,,,,,4422.0,
1530,The Kingâ€™s College,3033.0,2158.0,127.0,57.0,45.0,540.0,630.0,510.0,600.0,520.0,640.0,24.0,28.0,504.0,489.0,15.0,504.0,489.0,15.0,127.0,125.0,2.0,,,,0.0,71.0,6.0,27350.0,29240.0,29240.0,31300.0,48717.0,48717.0,New York,New York,Mid East DE DC MD NJ NY PA,"Private not-for-profit, 4-year or above",Four or more years,...,78.0,5.0,0.0,3.0,3.0,60.0,,,,,,,,,,,,,,,,,,,,57.0,61.0,61.0,100.0,100.0,34.0,34.0,10.0,4.0,100.0,57.0,56.0,12.0,,935.0
1531,Ottawa University-Online,,,,,,,,,,,,,,445.0,2.0,443.0,379.0,1.0,378.0,11.0,0.0,11.0,66.0,1.0,65.0,0.0,,,,,,,,,Kansas,Kansas,Plains IA KS MN MO NE ND SD,"Private not-for-profit, 4-year or above",Four or more years,...,60.0,0.0,10.0,0.0,3.0,61.0,12.0,3.0,11.0,4.0,0.0,65.0,0.0,5.0,0.0,3.0,60.0,,,,,,,,,,,,,,,,,,,,,,,20863.0
1532,Providence Christian College,122.0,65.0,20.0,,,,,,,,,,,68.0,68.0,0.0,68.0,68.0,0.0,20.0,20.0,0.0,0.0,0.0,0.0,0.0,53.0,31.0,20444.0,21444.0,22686.0,24222.0,38602.0,38602.0,California,California,Far West AK CA HI NV OR WA,"Private not-for-profit, 4-year or above",Four or more years,...,79.0,3.0,0.0,5.0,6.0,52.0,,,,,,,,,,,,,,,,,,,,46.0,54.0,54.0,100.0,100.0,50.0,50.0,14.0,0.0,100.0,64.0,64.0,14.0,,350.0
1533,Polytechnic University of Puerto Rico-Orlando,,,,,,,,,,,,,,128.0,33.0,95.0,83.0,16.0,67.0,1.0,1.0,0.0,45.0,17.0,28.0,0.0,,,10920.0,10920.0,10920.0,10920.0,,,Florida,Florida,Southeast AL AR FL GA KY LA MS NC SC TN VA WV,"Private not-for-profit, 4-year or above",Four or more years,...,2.0,0.0,0.0,0.0,0.0,35.0,0.0,0.0,0.0,98.0,0.0,0.0,0.0,2.0,0.0,0.0,33.0,,,,,,,,,,,,100.0,100.0,100.0,100.0,0.0,50.0,0.0,50.0,50.0,0.0,,


In [None]:
df.dtypes

Name                                                      object
Applicants total                                         float64
Admissions total                                         float64
Enrolled total                                           float64
Percent of freshmen submitting SAT scores                float64
                                                          ...   
Percent of freshmen receiving student loan aid           float64
Percent of freshmen receiving federal student loans      float64
Percent of freshmen receiving other loan aid             float64
Endowment assets (year end) per FTE enrollment (GASB)    float64
Endowment assets (year end) per FTE enrollment (FASB)    float64
Length: 108, dtype: object

In [None]:
profile = pandas_profiling.ProfileReport(df.iloc[:10,:10], explorative = True)

In [None]:
#visualizing a profile report which showcases general exploratory data analytics
profile

Summarize dataset:   0%|          | 0/5 [00:00<?, ?it/s]

Generate report structure:   0%|          | 0/1 [00:00<?, ?it/s]

Render HTML:   0%|          | 0/1 [00:00<?, ?it/s]



### 5. Modeling and Evaluation
---
*(Feature engineering and selection. Build data mining models. Compare and evaluate the performance of the models.)*

###Instead of building numerous models to predict a single target variable, we as a group collectively decided to pose questions and build models to answer these questions. The questions that we ask are: 

#1. How can we predict if a university is a public or private university through columns such as geographic region, level of institution etc?

### For the model, please refer to this link: https://colab.research.google.com/drive/1UBP1jTM7Ie4ZH4DPWomHELQpewXgswAT?usp=sharing

### This is an extra part: https://colab.research.google.com/drive/13g62G_dSGkRm0PAuyZqj48KB0lq6ZJpZ?usp=sharing

#2. How can we predict the total cost of enrollment (tuition and living) of universities through columns such as admission rate, enrollment rate etc?

### For the model, please refer to this link: https://colab.research.google.com/drive/1ZKF2dzBAID8c2Cqz3eHPLQCmtPPEKgAR?usp=sharing

#3. How can we use SAT scores data to classify universities within a given geographic region?

### For the model, please refer to this link:https://colab.research.google.com/drive/1jU0k_xkdkzJzF8aAJlFN-3soN7ojolP0?usp=sharing


### 6. Conclusion
---
*(Briefly describe what you have done and what you discovered. Discuss any shortcomings of the process and results. **Finally, discuss the lessons learned from doing the project**.)*

## The learnings from the project can be segregated based on the questions we answered as each question acknowledged different aspects of the dataset:

###Question 1:

This question focused on multiple variables that could impact whether an institution is public or private. We look at the state and region the institution was in, the Carnegie Classification of the institution, and the degree of urbanization for the location of the institution.

We looked at the precision, f1_measure, and recall score of knn_bow_binary,	knn_bow_count,	knn_tfidf,	dt_bow_binary,	dt_bow_count,	dt_tfidf,	rf_bow_binary,	rf_bow_count,	rf_tfidf for each of the mentioned variables. We discovered that KNN had the highest precision across the board except for a single instance in "Carnegie v. Target" where the recall score of KNN_TFIDF passed the precision score of KNN_TFIDF. 

Recall and F1-Measures were pretty similar most of the time with each test, but the F1-Measure was consistently at the bottom of the scoreboard. This makes sense because precision quantifies the number of positive predictions that belong to the positive class. while recall quantifies the number of positive predictions made out of every single positive examples from the data.
F1-Measure provides a single score that sort of evens out what precision and recall found. F1-Measure takes both false positives and false negatives into account.

Although our test is inconclusive, there appears to be some correlation between the urbanization of a city and the prevalance of public/private schools, little correlation between state and the prevalance of public/private schools, some correlation between the Carnegie Classification and if a school is public/private, and little to no correlation between region and if a school is public or private. 

We struggled a bit with this problem because we couldn't decide which sections we wanted to use. Some columns were giving us errors and COLAB has a lot of bugs that requires us to refresh the page in order to resolve them. We did a lot of troubleshooting with our code since we needed to adapt it to the dataset on hand. 

The biggest lessons we took from this project was that data science is deeply enjoyable and rewarding. Although we faced obstacles completing the models and performing the testing, we produced a great delivery and will be adding it to our portfolio when we apply for positions regarding data science.

### Question 2 

This question focused on identifying a relationship between factors such Admission Rate, Enrollment Rate, SAT Percentile Scores, ACT Percentile Scores, Percentage Increase in tuition from past years and calculate which model predicts this percent increase better. Here are the conclusions that we get after analyzing the data for both the model - 

<ol>

<li> In the KNN model, we were able to test value for numerous k-values using GridSearch CV and identify k value of 7 gives the best result with the most mean square error minimized score.  </li>

<li> In the Polynomial Regression model, we were able to test value for degrees up to 4 and identify degree of 2 gives the best result with the most mean square error minimized score. Calcuation of a degree greater than 4 was not possible due to the production of large results and absence of required RAM to run these calculations. </li>
<li> As per MSE, knn model does a better job at prediction with a lower MSE score than the polynomial model </li>
<li> The KNN model also had consistent results as compared to the Polynomial model</li>

<li>Through this project, we were able to work with a KNN Regressor model when in general we have only worked with KNN Classifier model. This helped us explore regression methods and metrics </li>

<li> This was also our first time creating a polynomial regression model which further helped us understand the usefulness of the regression space</li>
</ol>


###Question 3: 
This question focused on how we can use the SAT data to find correlations accordingly between different features and predict a feature with categorical data. Specifically, for the first part (where we used KNN to predict the geographic region), we focused on looking at how the SAT scores vary across different geographic regions. Using the scatter plot initially, it was evident that there was a lot of variance and noisy data, which gave an indication of underfitting/overfitting the model prior to creating the model itself. The only columns selected for this data frame included the SAT scores and the geographic region. 

As a result, it might be possible that these values are not directly correlated or correlated to a large extent to be able to predict. In fact, if we check this theory with the correlation matrix, in the beginning of the report (profile section), there is very little correlation between the SAT scores and the geographic region directly. The scores are very much spread out across different ranges for each of the geographic regions, so the KNN model was only able to give a performance accuracy of about 50% which is very low. For future purposes, perhaps more values can be included as part of the training data to improve the efficacy of the KNN model and avoid the vagueness or generalization as shown in the scatter plot. 

The second part of the question was more interesting, where we tried to use the following columns: 'SAT Critical Reading 25th percentile score', 'SAT Critical Reading 75th percentile score', 'SAT Math 25th percentile score', 'SAT Math 75th percentile score', 'SAT Writing 25th percentile score', 'SAT Writing 75th percentile score', 'Percent admitted - total', 'Tuition and fees, 2010-11', 'Tuition and fees, 2011-12', 'Tuition and fees, 2012-13', 'Tuition and fees, 2013-14', 'Geographic region', 'Sector of institution', 'Degree of urbanization (Urban-centric locale)', 'Percent of first-time undergraduates - out-of-state', 'Percent of first-time undergraduates - foreign countries', 'Graduation rate - Bachelor degree within 4 years, total'. The question that we were trying to answer through these set of features was about how the graduation rate of a college is dependent or affected by the geographic region, average tuition fee across the years, SAT scores, and the percentages of first-time undergraduates. 

This data frame consisted of a mix of continuous and categorical data which allowed for the use of KNN and decision tree classifiers in their own ways. Using both models, we were trying to see how graduation rates are affected. All the columns that were part of this data frame were selected because they were highly correlated to the graduation rate feature which allowed for a much more meaningful model. The performance of the KNN model as compared to the first part of the question changed considerably from 50% to 90% because of choice of better features as part of the data frame and more data which allowed for specific classification and more instances for learning from the training data. Although, it is quite possible that the performance accuracy of 90% is very high for KNN because it might be overfitting, yet this model is far more superior than the one in the first part. 

5-fold cross validation fit also allowed to learn and use different parts of the training data to get the accuracy. The performance of the decision tree classifier also changed by creating the random forest by a little bit because random forest algorithm randomly selects a subset of features. More importantly, random forest builds multiple trees that consists of voting to get a better set of trees to work with and a larger sample set. 

### 7. References

1. Stack Overflow
2. https://machinelearningknowledge.ai/knn-classifier-in-sklearn-using-gridsearchcv-with-example/
3. https://medium.com/sanrusha-consultancy/k-nearest-neighbor-knn-regression-and-fun-behind-it-7055cf50ae56
4. https://www.section.io/engineering-education/polynomial-regression-in-python/#step-4-training-the-polynomial-regression-model-on-the-whole-dataset
5. https://medium.com/swlh/linear-regression-simple-multiple-and-polynomial-4741a5e13eb6