Requirements
 - Input the data
 - Create a new column in the Part 1 Output table for each student’s initials. If a student has a double barreled second name then only take the first letter from the first part
 - e.g. “NERTY CHERRY HOLME” becomes NC not NCH. 
 - Find a way to join this table to the Additional Information table. We should maintain exactly 4,000 unique records. 
 - Develop a ranking system to rank each student by Grade Score within their specified Subject Selection and Region. Every combination of Subject Selection and Region should have their own ranking and remember that if students have a matching grade score, we then prioritise those who live closer to the school as a “tie-breaker”. 
 - For each Subject, find and flag the top 20 students with the caveat that this year within each course, 15 students must be from the East and 5 from the West given our newly imposed 75%/25% split
 - Delete all rejected students, leaving only the 100 accepted students.
 - Remove unnecessary fields
 - Find the total number of accepted applicants per secondary school and represent this as a percentage of the total spaces that were available for that region. 
 - Hint: think about how many were allowed per course, and how many courses there are
 - For each region, label their highest performing school as “High Performing” and the lowest performing school as “Low Performing” in a new column named “School Status”.
 - Give all other schools the status “Average Performing”
 - Delete any unwanted fields and rearrange to give the output shown below
 - Output the data


In [81]:
import os
import pandas as pd

In [82]:
#  - Input the data
lu_df=pd.read_csv("Additional Info Lookup.csv")
student_df=pd.read_csv("Part 1 Output.csv")

In [83]:
lu_df.columns=lu_df.columns.str.lower().str.strip().str.replace(' ','_')
student_df.columns=student_df.columns.str.lower().str.strip().str.replace(' ','_')


In [84]:
#  - Create a new column in the Part 1 Output table for each student’s initials. If a student has a double barreled second name then only take the first letter from the first part
#  - e.g. “NERTY CHERRY HOLME” becomes NC not NCH. 

student_df['initials']=student_df.full_name.str.split().str[0].str[0]+student_df.full_name.str.split().str[1].str[0]

In [85]:
#  - Find a way to join this table to the Additional Information table. We should maintain exactly 4,000 unique records. 

# Alignd datetime columns
student_df.date_of_birth=pd.to_datetime(student_df.date_of_birth,format='%d/%m/%Y')
lu_df.date_of_birth=pd.to_datetime(lu_df.date_of_birth,format='%Y-%m-%d')
# Merge tables
combined_df=student_df.merge(lu_df,on=['initials','date_of_birth','school_name','english','science','maths'],how='left')
combined_df.shape

(4000, 14)

In [86]:
#  - Develop a ranking system to rank each student by Grade Score within their specified Subject Selection and Region. Every combination of Subject Selection and Region should have their own ranking and remember that if students have a matching grade score, we then prioritise those who live closer to the school as a “tie-breaker”. 

In [87]:
# rank_df=combined_df[['student_id','grade_score','region','subject_selection','distance_from_school_(miles)']]
rank_df=combined_df.loc[:,[
    'subject_selection',
    'region',
    'grade_score',
    'student_id',
    'distance_from_school_(miles)'
]]


In [91]:
rank_df['grade_score_rank']=rank_df.sort_values(by=['subject_selection','region','grade_score']).groupby(['subject_selection','region']).rank(method='first',ascending=False)['grade_score']

In [93]:
# rank_df.loc[rank_df.subject_selection=='Business Management'].sort_values(by=['dinstance_rank'])


rank_df.sort_values(by=['subject_selection','region','grade_score_rank','distance_from_school_(miles)']).groupby(['subject_selection','region','grade_score']).rank(method='first',ascending=True).merge(rank_df,left_index=True,right_index=True,suffixes=['_rank','']).to_csv('test.csv')

In [80]:
rank_df['dinstance_rank']=rank_df.sort_values(by=['subject_selection','region','grade_score_rank','distance_from_school_(miles)']).groupby(['subject_selection','region','grade_score']).rank(method='first',ascending=True)['distance_from_school_(miles)']

In [None]:

# Sort Values
rank_df=rank_df.sort_values(by=['subject_selection','region','grade_score'])


In [None]:

# Create rank for grade score
rnk=rank_df.groupby(['subject_selection',
                     'region']).rank(method='first',ascending=False) # Using first to get unique ranks(Top 20)


In [None]:

# merge ranks with df
rank_df=rank_df.merge(rnk,left_index=True,right_index=True,suffixes=['','_rank']).drop(columns=['student_id_rank','distance_from_school_(miles)_rank'])

# Create rank for grade score
rnk=rank_df.groupby(['subject_selection',
                     'region']).rank(method='dense',ascending=False)['grade_score']# Using dense rank


In [None]:
# Add Column
rank_df['grade_score_rank_dense']=rnk

# # merge ranks with df

# rank_df=rank_df.merge(rnk,left_index=True,right_index=True,suffixes=['','_rank']).drop(columns=['student_id_rank','distance_from_school_(miles)_rank'])

# get rank for distance to school


In [None]:

rnk=rank_df.groupby(['subject_selection',
                     'region',
                     'grade_score_rank_dense']).rank(method='dense')['distance_from_school_(miles)']

# merge ranks with df
rank_df=rank_df.merge(rnk,left_index=True,right_index=True,suffixes=['','_rank'])


In [None]:

# Get row number by region
rnk=rank_df.groupby(['region']).rank(method='first',ascending=False)['subject_selection']# Using dense rank

# Add Column
rank_df['region_rank']=rnk


In [None]:
rank_df

# merge ranks with df

# rank_df=
# rank_df.merge(rnk,left_index=True,right_index=True,suffixes=['','_rank'])

In [None]:
#  - For each Subject, find and flag the top 20 students with the caveat that this year within each course, 15 students must be from the East and 5 from the West given our newly imposed 75%/25% split
rank_df[(rank_df['region']=='EAST')&
        (rank_df['region_rank']<=25)].sort_values(by=['subject_selection','region','grade_score_rank','distance_from_school_(miles)_rank'])

In [None]:
rank_df.merge(rnk,left_index=True,right_index=True,suffixes=['','_rank']).to_csv('test.csv')

In [None]:
rank_df.to_csv('test.csv',index=False)

In [None]:
os.startfile('test.csv')