# Project - College Admission
### Nan Zhao

## Objective of the Proposed Research

The faculty in the Data Science Program at the University of the Pacific plans to evaluate the applicants for next year's cohort. To find the 5 most fit candidates in order of priority with their most salient characteristics, the faculty needs to know about every candidate's application forms.  
The objective of this research is to help the faculty find the qualified candidates and the 5 most fit candidates in order of priority with their most salient characteristics.  
To achieve this objective, I'll read the Criteria for Admission, and gathers the prerequisite entry requirements. Then I'll loop through the Candidates.csv file to identify candidates qualified for admission. After that, I'll prioritize the candidates based on their educational background and how much their skills match the criteria for admission.  
  
I have 15 txt file for candidates' application forms. Each application form contains the candidate's name, skills, education, and passion. But some application forms have errors like spaces, blank lines, or missing contents.  
I have one csv file called Candidates.csv that contains a list of 15 candidates, including the name of the candidate and the name of the text file containing his/her form.   
I have one txt file called Criteria.txt that contains the Criteria for Admission. This file contains prerequisite entry requirements and detailed admission priorities.   


## Explanation of Data Sources

I created the text files for candidates' application forms from scratch based on the related educational background and prerequisite programming skills in data science.  
Each application form contains the candidate's name, skills, education, and passion. The candidate's names are nicknames. Their skills contain 'C', 'C++', 'Java', 'JavaScript', 'Python', 'R', 'ML', 'SQL', 'Excel', and 'NoSQL'. They have different passions which are their future career direction.  
The prerequisite entry requirements of Data Science Program include a Bachelors degree, Python and SQL. The candidates will be prioritized based on their educational background and computer programming of C and C++.  

## Explanation of the steps of the algorithm

The objective of this research is to help the faculty find the qualified candidates and the 5 most fit candidates in order of priority with their most salient characteristics.  
To achieve this objective, I'll identify candidates qualified for admission, then I'll prioritize the candidates based on their educational background and their skills.  
First, I read the Criteria for Admission, and gathers the prerequisite entry requirements.  
Second, I loop through the Candidates.csv file to identify all qualified candidates. They both have a Bachelors degree and have programming skills of Python and SQL.  
Third, I use priorities() function to prioritize the candidates based on their educational background and computer programming of C and C++.  
Finally, I create a text file called ToAccept.csv that contains a list of the 5 most fit candidates in order of priority.

## Cleanup performed in Python with Explanation

I have 15 txt file for candidates' application forms. Each application form contains the candidate's name, skills, education, and passion.  
To precisely filter qualified candidates, I skip the lines of candidates' passions by using join() method and find() method. Some forms didn't contain their passion, but this will not impact the results.  
Some application forms have errors like spaces, blank lines, or missing contents. For example, Jennifer's form has some blank lines, but it will not impact python to loop through the txt file or impact the results.   
Jennifer's form also has some blanks: '  Skills  : Python'. In order to identify her skills, join() method in Python could join lists as a string, then find() method in Python could determine if the string 'Python' is a substring in her form.  

### 1. Reads the Criteria for Admission, and gathers the prerequisite entry requirements.

In [1]:
for i in open('Criteria.txt'):
    print(i[:-1])

Data Science Program Overview
The MS in Data Science prepares graduates for careers in data analytics and related fields.  This is a science (as opposed to business) based program that is focused on developing students’ math foundation in statistics and linear algebra, and computer programming to prepare them for coursework in topics like machine learning, time series analysis, customer analytics, and data visualization.

This 32-unit, 4-semester degree culminates in a Capstone Project, in which students work on an analytics problem with a corporation in the Silicon Valley/Northern California region.

Prerequisite entry requirements include:
Educational Requirements: A Bachelors degree
Skills Requirements: Python, SQL

Priorities:
1. MS in Computer Science, MS in Data Science, MS in Applied Mathematics, MS in Applied Statistics, MS in Informatics
2. BS in Computer Science, BS in Data Science, BS in Applied Mathematics, BS in Applied Statistics, BS in Informatics
3. C
4. C++



### 2. Loop through the Candidates.csv file to identify all qualified candidates. 

In [2]:
import numpy as np
import pandas as pd
import csv

In [3]:
df = pd.read_csv('Candidates.csv',header=None,names=['Name','File Name'],skiprows=3)
df.head()

Unnamed: 0,Name,File Name
0,Jennifer,1.txt
1,Michelle,2.txt
2,Kathy,3.txt
3,Martha,4.txt
4,Sara,5.txt


In [4]:
# Print the first three application forms
# Some application forms have errors like spaces, blank lines, or missing contents.

def print_forms(filename):
    candidate = []
    with open(filename, 'r') as candidate_file:
        candidate_array = csv.reader(candidate_file)
        for row in candidate_array:
            print(row)
    candidate_file.close()

filename = df.iloc[:,1]
for i in filename[:3]:
    print_forms(i)

[]
['Name: Jennifer']
[]
[]
['  Skills  : Python', 'C++', 'Java', 'JavaScript', 'C', 'R', 'ML', 'SQL', 'Excel', 'NoSQL']
['MS in Computer Science']
['Passion: Sentiment Analysis and Opinion Mining']
['Michelle']
['C', 'C++', 'Java', 'JavaScript', 'Python', 'SQL', 'ML']
['MS in Computer Science']
['Passion: Consumer Analytics']
['Name: Kathy']
['Skills: Python', 'C', 'C++', 'Java', 'JavaScript', 'ML', 'SQL']
['BS in Data Science']
['Passion: Time Series Analysis']


In [5]:
# Based on the Criteria for Admission, requirements() method corresponds to the the prerequisite entry requirements.
# The prerequisite entry requirements of Data Science Program include a Bachelors degree, skills of Python and SQL.

In [6]:
def requirements(filename):
    candidate = []
    with open(filename, 'r') as candidate_file:
        candidate_array = csv.reader(candidate_file)
        for row in candidate_array:
            row = "".join(row)   #In order to identify her skills, join() method could join lists as a string
            if (row.find('Passion') != -1) or (row.find('passion') != -1):  #Cleanup: skip the lines of passions to precisely filter the criteria.
                continue
            if (row.find('Python') != -1) and (row.find('SQL') != -1):  #find() method could determine if the string 'Python' is a substring in her form.
                for row in candidate_array:
                    row = "".join(row)     
                    if (row.find('Passion') != -1) or (row.find('passion') != -1):
                        continue
                    if (row.find('BS') != -1) or (row.find('MS') != -1) or (row.find('phD') != -1):  
                        candidate.append(candidate_file.name)
                    
    candidate_file.close()
    return candidate

In [7]:
# The list of acceptable_candidates contains all qualified candidates.

In [8]:
acceptable_candidates=[]

for i in range(df.shape[0]):
    acceptable_candidates.append(requirements(filename[i]))
while [] in acceptable_candidates:
    acceptable_candidates.remove([])  #remove empty lists
    
acceptable_candidates

[['1.txt'], ['2.txt'], ['3.txt'], ['4.txt'], ['5.txt'], ['9.txt']]

### 3. Prioritize the candidates based on their educational background and computer programming skills of C and C++.

In [9]:
# priorities() method - Prioritize the candidates based on their educational background and computer programming of C and C++.

In [10]:
def priorities(filename):
    first=[]
    second=[]
    third=[]
    forth=[]
    with open(filename, 'r') as candidate_file:
        candidate_array = csv.reader(candidate_file)
        for row in candidate_array:
            if ('MS in Computer Science' in row) or ('MS in Data Science' in row) or ('MS in Applied Mathematics' in row) or ('MS in Applied Statistics' in row)  or ('MS in Informatics' in row):
                first.append(candidate_file.name)
            if ('BS in Computer Science' in row) or ('BS in Data Science' in row) or ('BS in Applied Mathematics' in row) or ('BS in Applied Statistics' in row)  or ('BS in Informatics' in row):
                second.append(candidate_file.name)
            if ('C' in row):
                third.append(candidate_file.name)
            if ('C++' in row):
                forth.append(candidate_file.name)
    candidate_file.close()
    return first, second, third, forth

In [11]:
first_l=[]
second_l=[]
third_l=[]
forth_l=[]
final=[]

for i in range(len(acceptable_candidates)):
    first, second, third, forth = priorities("".join(acceptable_candidates[i]))    
    first_l.append(first)    
    second_l.append(second)
    third_l.append(third)
    forth_l.append(forth)

In [12]:
final = first_l + second_l + third_l + forth_l

In [13]:
while [] in final:
    final.remove([])

In [14]:
# The list of final_l contains all qualified candidates in order of priority.

In [15]:
final_l = []
for i in final:
    if not i in final_l:
        final_l.append(i)
final_l

[['1.txt'], ['2.txt'], ['4.txt'], ['9.txt'], ['3.txt'], ['5.txt']]

### 4. Creates a text file called ToAccept.csv that contains a list of the 5 most fit candidates in order of priority.

In [16]:
# final_list contains the 5 most fit candidates in order of priority.

In [17]:
final_list=final_l[:5]
final_list

[['1.txt'], ['2.txt'], ['4.txt'], ['9.txt'], ['3.txt']]

In [18]:
name=[]
for i in final_list:
    name.append(df[df['File Name'].isin(i)].iloc[:,0])
name = np.array(name).tolist()
name

[['Jennifer'], ['Michelle'], ['Martha'], ['Julia'], ['Kathy']]

In [19]:
edu = ['MS in Computer Science', 'MS in Data Science', 'MS in Applied Mathematics', 'MS in Applied Statistics', 'MS in Informatics', 'BS in Computer Science', 'BS in Data Science', 'BS in Applied Mathematics', 'BS in Applied Statistics', 'BS in Informatics']

In [20]:
# print_forms() method filter and output the candidates' most salient characteristics.

In [21]:
def print_forms(filename):
    l = []
    with open(filename, 'r') as candidate_file:
        candidate_array = csv.reader(candidate_file)
        for row in candidate_array:

            if (set(['C']).issubset(set(row))):
                l.append('C')

            if (set(['C++']).issubset(set(row))):
                l.append('C++')
            
            for i in edu:
                if i in row:
                    l.append(i)

    candidate_file.close()
    return l                   

In [22]:
l1=name[0] + print_forms(final_list[0][0])
l2=name[1] + print_forms(final_list[1][0])
l3=name[2] + print_forms(final_list[2][0])
l4=name[3] + print_forms(final_list[3][0])
l5=name[4] + print_forms(final_list[4][0])

In [23]:
l1.insert(1,l1[-1])
l1.pop()
l2.insert(1,l2[-1])
l2.pop()
l3.insert(1,l3[-1])
l3.pop()
l4.insert(1,l4[-1])
l4.pop()
l5.insert(1,l5[-1])
l5.pop()

l=[l1,l2,l3,l4,l5]
l

[['Jennifer', 'MS in Computer Science', 'C', 'C++'],
 ['Michelle', 'MS in Computer Science', 'C', 'C++'],
 ['Martha', 'MS in Computer Science', 'C', 'C++'],
 ['Julia', 'MS in Data Science'],
 ['Kathy', 'BS in Data Science', 'C', 'C++']]

In [24]:
# Create the text file called ToAccept.csv that contains a list of the 5 most fit candidates in order of priority with their most salient characteristics.

In [25]:
with open("ToAccept.csv", "w", newline='') as csvfile:
    writer = csv.writer(csvfile)
    writer.writerows(l)  
csvfile.close()

In [26]:
# Read the file ToAccept.csv

In [27]:
with open("ToAccept.csv", "r", newline='') as csvfile:
    l = csv.reader(csvfile)
    for row in l:
        print(row)
csvfile.close()

['Jennifer', 'MS in Computer Science', 'C', 'C++']
['Michelle', 'MS in Computer Science', 'C', 'C++']
['Martha', 'MS in Computer Science', 'C', 'C++']
['Julia', 'MS in Data Science']
['Kathy', 'BS in Data Science', 'C', 'C++']
