In [None]:
#importing the libraries 
import pandas as pd
import numpy as np
import re
import matplotlib.pyplot as plt

In [None]:
#Importing the results data

df = pd.read_csv('/data/result.csv')

Lets see what the data looks like

In [None]:
#Lets Take a look at the data.

df.head()

Print the info() of results table

In [None]:
df.info()

## Data Cleaning

From the info table above we see that there are null values in the centre column,21 to be exact. Lets look at which rows have this

In [None]:
df[df['centre'].isna()]

Its all those students whose results are reserved. That is not released yet. This means we can't use their data for the analysis of status or sgpi as the data literally doesn't exist. However we can still use the gender and year of admission for generating insights.

While we are at it, lets look at the status column of the data.

In [None]:
df['status'].value_counts()

There are 497 'RR' values! Thats a lot of reserved results. lets take a closer look

In [None]:
df[df['status'] == 'RR '].head(15)

Lets make a different dataframe that contains the results of students that have their results declared i.e not reserved. This will be useful for analysis later

In [None]:
declared_results = df[df['status'] != 'RR '].copy()

We can combine 'UM','ABS' into one 'Unsuccessful' value as they essentially mean the same thing.

In [None]:
declared_results['status'] = declared_results['status'].map(lambda x: "Unsuccessful " if (x in ['UM ','ABS ']) else x)

Lets make sure it has been combined

In [None]:
declared_results['status'].value_counts()

## Exploratory Data Analysis

A visualisation would really help us grasp these numbers lets make a pie chart.

In [None]:
declared_results['status'].value_counts().plot(kind='pie',autopct='%.1f%%')

So nearly 90% of students whose results were declared passed and only 10% failed. The ratio of pass to fail being 8.7

Lets further explore the data<br>
Lets see the oldest year of admission since this data is available for all students we will use the whole table.

In [None]:
df['year_of_admission'].value_counts().sort_index()

In [None]:
df['year_of_admission'].value_counts().sort_index().plot(kind='bar')

The student who took the earliest admission was in 2014! This means that he has been studying for 8 years! Out of curiosity lets see if he managed to clear the exams this time.

In [None]:
df[df['year_of_admission'] == 2014] 

Successful, with a very decent sgpi of 6.86<br>
Lets look at how the students are located geographically.

In [None]:
df['centre'].value_counts()

The result contains colleges from 17 different locations(talukas).<br>
Interestingly Mumbai University has more students from Thane than from Mumbai. Atleast according to the data released.

Lets see how many different colleges are there under MU in the released result.

In [None]:
len(df['clg_id'].unique())

There are 53 different colleges. Lets see which college has the most students and where the college is.

In [None]:
df.groupby(['clg_id','centre'])['prn'].count().sort_values(ascending=False)

233 students in one college! a little too much I think. It is closely followed by 227 students in college with id 237. Unfortunately we don't have the data on which college has which id so we don't know which college this is.

Next lets see what percent of students are Male and Female

In [None]:
df['gender'].value_counts().plot(kind='pie',autopct='%.1f%%')

80% percent of students are male. This is in Computer science engineering, one of the few engineering fields with higher ratio of females.
Ratio of male to female students is 70.7/29.3 = **2.41**

Finally lets have a look at the sgpi distribution. We can quickly see the stats using the *describe* method.

In [None]:
declared_results['sgpi'].describe()

We see that the min sgpi is 0.0 that is the students that have failed. Since all the sgpi under 4.0 is grouped to 0.0 we will ignore these as they will negatively affect our distribution pulling the center of distribution towards 0.0. 

In [None]:
declared_results.query("sgpi >= 4.0")['sgpi'].describe()

When we look at the sgpi distribution of students that have passed, i.e have a sgpi greater than 4.0, we see that the min sgpi is 5.45. This is very interesting as it means that out of 4227 students not a single student scored below 5.45 or in other words was "just passed". <br>
Lets plot the graph of this distribution.

In [None]:
declared_results[declared_results['sgpi'] > 0]['sgpi'].plot(kind='density',grid=True,xlim=(4,10),xticks=np.arange(4.0,10.5,0.5))

On first glance it looks like the distribution is left skewed as most of the values are towards the right. However if look at the mean and the median values from the describe method (7.87 and 7.86 respectively) we see that both the mean and the median lie on nearly the same point (7.8). Thus the distribution is not actually skewed and is normally distributed, with mean at 7.8.<br>
The excess kurtosis, simply by looking at the graph, seems to be near 0. Meaning this is a near perfect normal distribution.

If we look at the z-score for 95% confidence interval, we find that the value must be above or below 1.96 standard deviation from mean.<br>
For our data that means the value must be 9.42 or below 6.32 for the data to be statistically significant.<br>
*Basically, for any college we should expect the mean sgpi to lie between 6.32 and 9.42. If any college has a mean sgpi above or below this we can confidently say that it is significantly different.* Essentially any college with mean sgpi above 9.42 is significantly good and any college with mean sgpi below 6.32 is significantly bad.<br>
We will use this later to see if there are any exceptional colleges in the data.

Lets answer some questions

## Q1. How many male and female failed the semester? Is the ratio same as the ratio of passed students

From our EDA above we see that there are 70% male and 30% female. Lets see how many of them failed.

In [None]:
f, (ax1, ax2) = plt.subplots(1, 2,figsize = (8,10))
ax1.pie(declared_results.query("status == 'Unsuccessful '").groupby('gender')['gender'].count(), labels = ['Female','Male'], autopct = '%.1f%%')
ax1.set_title('Percent of Male/Female Failed')
ax2.pie(declared_results.query("status == 'Successful '").groupby('gender')['gender'].count(), labels = ['Female','Male'], autopct = '%.1f%%')
ax2.set_title('Percent of Male/Female Passed')

There is some difference in the amount of Male/Female passed and failed. It seems more number of female students passed than male. Now that we have seen how they compare against each other lets see how they compare with themselves.

In [None]:
f, (ax1, ax2) = plt.subplots(1, 2,figsize = (8,10))
ax1.pie(declared_results.query("gender == 'M'").groupby('status')['status'].count(), labels = ['Passed','Failed'], autopct = '%.1f%%')
ax1.set_title('Percent of Male Passed/Failed')
ax2.pie(declared_results.query("gender == 'F'").groupby('status')['status'].count(), labels = ['Passed','Failed'], autopct = '%.1f%%')
ax2.set_title('Percent of Female Passed/Failed')

Lets try a different visualisation to make it clear

In [None]:
declared_results.groupby(['gender','status'],as_index=True).size().unstack().plot(kind='bar',stacked=True)

We see that female students are unsuprisingly better than male students as more number of males failed in comparision to female student and also when compared to themselves, that is a higher percent of males failed than female

## Q2. Is the ratio of students that passed/failed approximately same across all colleges?

To answer this we will group by the clg_id and find the count of successful and unsuccessful students. Then we will unstack it to make it easier to manipulate.

In [None]:
declared_results.groupby(['clg_id','status']).size().unstack()

There are a few NaN values in there, lets look at the data for these colleges.

In [None]:
declared_results[declared_results['clg_id'] == 807]

There is only one student from college 807 whose result was declared, that too was unsuccessful. This shows a crucial flaw in our approach. We can't look at the ratio for every college since there may be a lot of them with very few students whose results are declared. Such a low sample size can't be useful. So lets only look at the colleges which has more than 30 students whose results are declared.


In [None]:
clg_id = declared_results.groupby('clg_id',as_index=False).size().query("size > 30")['clg_id']

Now that we have the clg_ids of colleges with more than 30 students that have their results declared we will use it to filter the ratio our answer

We will store the table with successful and unsuccessful students of each college in a temporary variable for the scope of this question. And then calculate a new *ratio* column to see ratio of successful to unsuccessful students. And filter it with the clg_id from above

In [None]:
q2_df = declared_results.groupby(['clg_id','status']).size().unstack()
q2_df = q2_df.loc[q2_df.index.isin(clg_id)]

q2_df['ratio'] = q2_df['Successful ']/q2_df['Unsuccessful ']

In [None]:
q2_df.describe()

From our EDA above we know that the population ratio of pass to fail is 8.7 <br>
from the summary stats above we see that the median does lie near the same average but the mean is far ahead meaning that the distribution is skewed

Lets see all the colleges and the pass/fail ratio

In [None]:
q2_df.sort_values(by='ratio')

This is very interesting. Colleges with more number of students have less failed students (like college 428 and 890) while colleges with less students have more failed students(like college 948 and 889). <br>
Normally we would expect colleges with more students to have more number of failed students, becuase having more students raises probability of having students that will fail.<br>
One possible reason for this trend could be that the colleges with higher number of students are actually prestigious colleges, and hence are capable of high intake capacity, while the colleges with less students are not that great. This also means that students from these prestigious colleges were exposed to better teachers and more practice thus decreasing the percent of students that fail. While the opposite happened in small colleges.<br>
This also suggests that the majority of students that failed come from these smaller colleges.


## Q3. Which colleges have the most high scoring students?

First lets look at how many students scored 10.0 sgpi

In [None]:
declared_results.query("sgpi == 10.0")

Only two, that too female, as expected.<br>
So we will, for our analysis, consider students who scored between 9.0 and 10.0

Lets see how many students scored in that range

In [None]:
declared_results.query("sgpi >= 9.0")

369 students are enough to do some analysis. 

In [None]:
q3_df = declared_results.query("sgpi >= 9.0").copy()

q3_df.groupby('clg_id').size().sort_values(ascending = False).head()

college 366 is the college with the highest number of high scorers, a whoping 51 of them.<br>
But one thing to note is that its possible that college 366 has a lot of students also and thus 51 is not that high. To compare the values objectively we would have to divide it by the total students to get a value that can be compared.

In [None]:
clg_size = declared_results.groupby('clg_id').size()

In [None]:
q3_df_grouped = q3_df.groupby('clg_id',as_index=False).size()

In [None]:
def q3_ratio(row):
    return row['size']/clg_size.loc[row['clg_id']]

In [None]:
q3_df_grouped['topper_ratio'] = q3_df_grouped.apply(q3_ratio,axis=1)

In [None]:
q3_df_grouped.sort_values(by='topper_ratio',ascending = False).head().rename(columns = {'size':'toppers'}) #renamed the size column to toppers to make it more readable

Now we see that college 366 is still at top with 25% of their students scoring above 9.0 sgpi. However college 426 which wasn't in in the top above, has now come in the top as it has 19.5% of their students scoring above 9.0 sgpi. This is a more fair comparision than what we tried before.<br>
And if we look at the number of students here(from the answer to the previous question) we see that it has 100+ students, further supporting our hypothesis that colleges with more students are prestigious.

## Q4. Are there any colleges that have performed significantly better than other colleges?

To answer this question we can use the value we determined earlier of 9.42 sgpi, if any college has a mean sgpi of the passed students greater than 9.42 we can confidently say that it is a better college. <br>
Again we would use college where the total students whose results was declared is greater than 30. Also we will only take the mean of the students that passed since the sgpi of failed students is 0.0 it will negatively affect the mean pulling it a lot lower.

In [None]:
q4_df = declared_results[declared_results['status'] == 'Successful '].groupby('clg_id')['sgpi'].mean().copy()

In [None]:
q4_df = q4_df.loc[q4_df.index.isin(clg_id)]

In [None]:
q4_df[q4_df.values > 9.42]

NONE! There is not a single exceptionally well college. To be fair this is to be expected. But lets see if there is a bad college.(i.e with a mean sgpi less than 6.32)

In [None]:
q4_df[q4_df.values < 6.32]

None again. It seems there are no exceptionally good or bad colleges.<br>
Lets make a visualisation to show the distribution of sgpi for each college.(of passed students only)

In [None]:
declared_results[declared_results['status'] == 'Successful '].boxplot(column='sgpi',by='clg_id',rot=90,figsize=(15,10))
plt.suptitle('')
plt.title("sgpi of each college(of successful students)")

## Q5. Which elective subject was selected by most students?

For this we will have to use the *subject.csv* table. Lets start by importing it.

In [None]:
subjects = pd.read_csv("./data/subject.csv")
subjects.head()

In [None]:
#Importing subject-subjectcode.csv to rename codes with names
subject_name = pd.read_csv('./data/subject-subjectcode.csv',index_col='subject_code')

In [None]:
subject_name.info()

In [None]:
pd.concat([subjects['paper_6'].value_counts(),subjects['paper_7'].value_counts(),subjects['paper_8'].value_counts()],ignore_index=False)

In [None]:
subject_count = pd.concat([subjects['paper_6'].value_counts(),subjects['paper_7'].value_counts(),subjects['paper_8'].value_counts()])
subject_count.index = subject_count.index.astype(str)
subject_count_named = pd.concat([subject_count,subject_name],axis=1,join='inner')

In [None]:
subject_count_named.rename({0:'count'},axis=1,inplace=True)
subject_count_named

In [None]:
subject_count_named.plot(kind='bar',x='subject_name',y='count')

since 3 sorted series were concated we can clearly see the 3 groups in the graph as well. When the next bar is greater than the current bar it means that the current group ends here.

Overall Natural Language Processing was the most selected subject. While the most selected subject for paper 6, paper 7, paper 8 are Natural Language Processing, Blockchain, Cyber Security and Law Respectively.<br>
Which, interestingly, are also the top 3 most selected subjects in that exact order.

## Q6. What is the highest scored marks in total? and what are the highest scored marks for each subject?

Lets start by importing the marks table

In [None]:
marks = pd.read_csv("./data/marks.csv")
marks.head()

In [None]:
#marks[marks.notna().all(axis=1)]

In [None]:
marks['total'] = marks['paper_1_TOT'] + marks['paper_2_TOT'] + marks['paper_3_TOT'] + marks['paper_4_TOT'] +marks['paper_5_TOT']+marks['paper_6_TOT']+marks['paper_7_TOT']+marks['paper_8_TOT']+marks['paper_9_TOT']+marks['paper_10_TOT']

In [None]:
marks[['seat_no','total']].sort_values(ascending = False,by='total',ignore_index=True).head()

The top scorers are both tied at 639 marks each. These are the very two girls that we saw in a question above where we looked at the top scorers by sgpi.

If you look at the 3rd highest value you'll notice that the person has only one mark less than the toppers. Yet as we know there are only two students with 10 sgpi. Thus the 3rd highest student didn't get a 10 pointer. Lets confirm this.

In [None]:
declared_results.query("seat_no == 5016743")

She got 9.86 instead of a 10 because of 1 mark!

Lets look at the sgpi of the 4th highest scorer.

In [None]:
declared_results.query("seat_no == 5018497")

9.86! Its the same. Even though 4th rank scored a whole 5 marks less than 3rd rank she still has the same sgpi while 3rd ranker has a lesser sgpi than the toppers even though she has only one less mark. 

Lets see the highest marks for each subject.

In [None]:
cols = ['paper_1_TOT','paper_2_TOT','paper_3_TOT','paper_4_TOT','paper_5_TOT','paper_6_TOT','paper_7_TOT','paper_8_TOT','paper_9_TOT','paper_10_TOT']
marks[cols].max(axis=0)

Out of the 5 Written paper paper_8 had the highest mark scored, lets see what subject it was. First we find the student who scored 98, then we select the student from the subjects table

In [None]:
marks.query("paper_8_TOT == 98.0")

In [None]:
subjects[subjects['seat_no'] == marks.query("paper_8_TOT == 98.0")['seat_no'].values[0]]

Now we lookup the subject with code 42181

In [None]:
subject_name.query("subject_code == '42181'")

Management Information System, makes sense.

## Summary:

* There are 5211 students' data out of which 497 students' result hasn't been declared yet.
* Out of all the students that had their results declared 89.7% of students passed and 10.3% failed.
* The students have taken admission in various years with the latest being 2022 and the oldest being 2014.
* The students are from 17 different districts. Most of the students are from Thane and Mumbai taluka having 1841 and 1338 students respectively.
* There are 53 colleges who's data is contained in the data released
* Out of all the students 70.7% are male while 29.3% are female.
* Of all the students failed 81.3% were male while 18.7% were female, while 69.3% of students passed were male while 30.7% being female.
* Looking at each gender individually, 88.1% of males passed while 93.4% of females passed.
* The minimum SGPI achieved is 5.45, with the mean SGPI achieved being 7.87.
* The distribution of SGPI follows a normal distribution.
* Not all colleges have the same ratio of pass/fail as the total dataset.
* Colleges with more number of students have less failed students (like college 428 and 890) while colleges with less students have more failed students(like college 948 and 889)
* One possible reason for this trend could be that the colleges with higher number of students are actually prestigious colleges, and hence are capable of high intake capacity, while the colleges with less students are not that great. This also means that students from these prestigious colleges were exposed to better teachers and more practice thus decreasing the percent of students that fail. While the opposite happened in small colleges.
* This also suggests that the majority of students that failed come from these smaller colleges.
* College no 366 is the highest performing college among all with 51 students having sgpi above 9.0. It also has the highest ratio of toppers to total students, having 25% of thier students score above 9.0
* There are no colleges that have performed significantly better or worse than the average.
* Natural Language Processing was the most selected subject. While the most selected subject for paper 6, paper 7, paper 8 are Natural Language Processing, Blockchain, Cyber Security and Law Respectively. Which are also the top 3 most selected subjects in that exact order.
* The highest total marks scored is 639 and the highest marks scored in any subject is 98 (in management information system)
* There is quirk of the scoring system due to which a student that scored just one marks less than the topper (i.e 638 total marks) didn't get 10.0 sgpi but instead got 9.86 while at the same time another student scored 5 marks less than this student but still got the same sgpi of 9.86