### Note: 
This dataset consists of 33 features on the data of the students at the two schools. We want to know the relationship between each feature and its effect on students' grades.

### Library Imports

In [None]:
import pandas as pd
import matplotlib
import matplotlib.pyplot as plt
import seaborn as sns
import numpy as np

# 1_Data Acquisition
###  * Load Data for Analysis

In [None]:
df= pd.read_csv('../input/student-performance-data/student_data.csv')




### Dataset Description

Attributes for student-por.csv (Portuguese language course) datasets:                                                   

* school - student's school (binary: 'GP' - Gabriel Pereira or 'MS' - Mousinho da Silveira)                             
* sex - student's sex (binary: 'F' - female or 'M' - male)                                                             
* age - student's age (numeric: from 15 to 22)
* address - student's home address type (binary: 'U' - urban or 'R' - rural)
* famsize - family size (binary: 'LE3' - less or equal to 3 or 'GT3' - greater than 3)
* Pstatus - parent's cohabitation status (binary: 'T' - living together or 'A' - apart)
* Medu - mother's education (numeric: 0 - none, 1 - primary education (4th grade), 2 - 5th to 9th grade, 3 - secondary   education or 4 - higher education)
* Fedu - father's education (numeric: 0 - none, 1 - primary education (4th grade), 2 - 5th to 9th grade, 3 - secondary   education or 4 - higher education)
* Mjob - mother's job (nominal: 'teacher', 'health' care related, civil 'services' (e.g. administrative or police),       'at_home' or 'other')
* Fjob - father's job (nominal: 'teacher', 'health' care related, civil 'services' (e.g. administrative or police),     'at_home' or 'other')
* reason - reason to choose this school (nominal: close to 'home', school 'reputation', 'course' preference or           'other')
* guardian - student's guardian (nominal: 'mother', 'father' or 'other')
* traveltime - home to school travel time (numeric: 1 - <15 min., 2 - 15 to 30 min., 3 - 30 min. to 1 hour, or 4 - >1      hour)
* studytime - weekly study time (numeric: 1 - <2 hours, 2 - 2 to 5 hours, 3 - 5 to 10 hours, or 4 - >10 hours)
* failures - number of past class failures (numeric: n if 1<=n<3, else 4)
* schoolsup - extra educational support (binary: yes or no)
* famsup - family educational support (binary: yes or no)
* paid - extra paid classes within the course subject (Math or Portuguese) (binary: yes or no)¶
* activities - extra-curricular activities (binary: yes or no)
* nursery - attended nursery school (binary: yes or no)
* higher - wants to take higher education (binary: yes or no)
* internet - Internet access at home (binary: yes or no)
* romantic - with a romantic relationship (binary: yes or no)
* famrel - quality of family relationships (numeric: from 1 - very bad to 5 - excellent)
* freetime - free time after school (numeric: from 1 - very low to 5 - very high)
* goout - going out with friends (numeric: from 1 - very low to 5 - very high)
* Dalc - workday alcohol consumption (numeric: from 1 - very low to 5 - very high)
* Walc - weekend alcohol consumption (numeric: from 1 - very low to 5 - very high)¶
* health - current health status (numeric: from 1 - very bad to 5 - very good)¶
* absences - number of school absences (numeric: from 0 to 93) these grades are related with the course subject,          portuguese:
* G1 - first period grade (numeric: from 0 to 20)
* G2 - second period grade (numeric: from 0 to 20)
* G3 - final grade (numeric: from 0 to 20, output target)

 

In [None]:
#show all columns
pd.options.display.max_columns=None
df.info()

## Understanding the Data

In [None]:
 #display first rows of dataset
df.head()

In [None]:
#displays last 5 rows of dataset
df.tail()

In [None]:
df.shape

In [None]:
#list all name columns
df.columns

In [None]:
#displays unique column variables
for i in df.columns:
    print(i.ljust(15),df[i].unique())
    

In [None]:
#number of unique values 
df.nunique()

In [None]:
for i in df.select_dtypes(include='object'):
    print(df[i].value_counts())

### add average in columns for each student

In [None]:
df["grades average"]=(df.G1+df.G2+df.G3)/3
df.head()


# 2_Data Cleansing

### Check for Missing Values

In [None]:
df.isnull().sum()

### * There are no missing values in this dataset 
## Checking for Outliers

In [None]:
plt.figure(figsize=(15,8))
df.boxplot(color='b',sym='r+')


In [None]:
sorted(df)   #to find meadian for IQR 
Q1= df.loc[:, df.columns!='failures'].quantile(0.25)
Q3= df.loc[:, df.columns!='failures'].quantile(0.75)
IQR= Q3-Q1
print(IQR)

# 3_Data Exploring

In [None]:
lower_bound=Q1-(1.5*IQR)
upper_bound=Q3+(1.5*IQR)
df2=df[~((df<lower_bound)|(df>upper_bound))]

plt.figure(figsize=(15,8))
df2.boxplot(color='g',sym='rx')

In [None]:
sns.boxplot(x='absences', data=df2)


In [None]:
df3=df2[df2['absences']<16]
sns.boxplot(x='absences', data=df3)
# i understaneed from irene a.gyebi

In [None]:
plt.figure(figsize=(15,8))
df3.boxplot(color='b')

# 4_Data Analysis
## Data Statistics

In [None]:
df.describe().T

## Visualizing Data Correlation

In [None]:
plt.figure(figsize=(20,20))
sns.heatmap(df2.corr(), vmin=-1, cmap="plasma_r", annot=True)


# 5_Data Visualization

In [None]:
for i in df.columns:
    df[i].value_counts().plot(kind='bar',color='b')
    plt.title(i)
    plt.show()
    
#Regrision

#for i in df2.select_dtypes(include='number'):
#    sns.regplot(x=i,y='G3', data=df2)
#   plt.show()

In [None]:

plt.figure(figsize=(6,5))
sns.barplot(x= 'sex', y = 'grades average', data = df2, errwidth=1,saturation=1, palette='Blues_d') 
plt.title('sex vs grades average \n')
plt.show()

b = sns.swarmplot(x='age', y='G3',hue='sex', data=df2)
b.axes.set_title('Does age affect final grade?\n', fontsize = 20)
b.set_xlabel('Age', fontsize = 20)
b.set_ylabel('Final Grade', fontsize = 20)
plt.show()

## Q_1 
   #### Does gender affect the average score?
   ####  Answer :  no      
## Q_2
   #### Does age affect final grade?
   #### there seems to be no clear relation of age or gender with final grade

In [None]:
plt.figure(figsize=(6,6))
f = df2.loc[df2['Pstatus']=='A'].count()[0]
m = df2.loc[df2['Pstatus']=='T'].count()[1]
plt.style.use('ggplot')
plt.pie([f,m], labels=['A','T'],explode=[0.1,0.1],startangle=0,labeldistance=1.2,autopct='%.2f %%')
plt.title('Pstatus\n\n A  vs  T ')
plt.show()

plt.figure(figsize=(6,6))
sns.barplot(x= 'Pstatus', y = 'grades average', data = df2, errwidth=1,saturation=1, palette='Blues_d') 
plt.title('Pstatus vs grades average')

## Q_3
   #### Does the cohabitation status of parents (binary: "T" - living together or "A" - apart - affect the student's grades?
   #### Answer : no

In [None]:
plt.figure(figsize=(8,5))
sns.barplot(x= 'age', y = 'grades average', data = df2, errwidth=3,saturation=1, palette='Blues_d') 
plt.title('age vs grades average')

## Q_4 
   #### Does advanced age affect average scores?
   #### Answer :  Grades decrease with age... but we notice that in the twenties it increases

In [None]:
plt.figure(figsize=(15,8))
sns.boxplot(df3['traveltime'],df3['grades average'],color='g')
plt.show()

 ## Q_5 
   #### Does travel time affect student grades?
   #### Answer : yes ,the less travel time, the higher the demand for exam scores.

In [None]:
plt.figure(figsize=(10,8))
colors = sns.color_palette('bright')[0:5]
df.Mjob.value_counts().plot(kind='pie', autopct='%.0f%%', colors = colors)
plt.title('Mjob proportion in %')
plt.legend()

In [None]:
plt.figure(figsize= (15,5))
plt.subplot(1,2,1)
order_by = df.groupby('Fedu')['G1'].median().sort_values(ascending = False).index
sns.boxplot(x = df['Fedu'], y = df['G1'],order = order_by)
plt.xticks(rotation = 90)
plt.title('Fedu v/s G1')

plt.subplot(1,2,2)
order_by = df.groupby('Medu')['G1'].median().sort_values(ascending = False).index
sns.boxplot(x = df['Medu'], y = df['G1'],order = order_by)
plt.xticks(rotation = 90)
plt.title('Medu v/s G1')

plt.show()

## Q_6
   #### Does the education of the father and mother affect the student's grades?
   #### Answer : yes , the more parents learn, the higher the student's grades

In [None]:
plt.figure(figsize=(8,5))
sns.barplot(x= 'failures', y = 'grades average', data = df2, errwidth=3,saturation=1, palette='Blues_d')
plt.figure(figsize=(8,5))
sns.barplot(x= 'studytime', y = 'failures', data = df2, errwidth=3,saturation=1, palette='Blues_d')

In [None]:
sns.regplot(x='studytime', y='grades average', data=df2)

## conclusion   
   ####  *  Another batch of expected results. Students who study more score better on tests and quizzes and their failure rate decreases.

In [None]:
plt.figure(figsize=(8,5))
sns.barplot(x= 'famrel', y = 'grades average', data = df2, errwidth=3,saturation=1, palette='Blues_d')
plt.figure(figsize=(8,5))
sns.barplot(x= 'higher', y = 'grades average', data = df2, errwidth=3,saturation=1, palette='Blues_d')

plt.figure(figsize=(8,5))
sns.barplot(x= 'health', y = 'grades average', data = df2, errwidth=3,saturation=1, palette='Blues_d')

## conclusion   
   ####  *  The more students want to enroll in higher education, the higher their average score in exams.
## Q_7
   ####  Does the quality of family relationships affect the student's grades?
   #### Answer : yes , the higher the quality of family relations, the higher the student's grades.
## Q_8
   #### Does the current health status affect the student's grades ?
   #### Answer : No, it doesn't have a high effect.

In [None]:
plt.figure(figsize=(8,5))
sns.barplot(x= 'G1', y = 'grades average', data = df2, errwidth=3,saturation=1, palette='Blues_d')
plt.figure(figsize=(8,5))
sns.barplot(x= 'G2', y = 'grades average', data = df2, errwidth=3,saturation=1, palette='Blues_d')

plt.figure(figsize=(8,5))
sns.barplot(x= 'G3', y = 'grades average', data = df2, errwidth=3,saturation=1, palette='Blues_d')

## conclusion   
   ####  *  The higher the student's score in the first, second period grade and final grade , the higher the student's average score.
   ####  * The higher the student's grades in the first and second period grade, the higher his grades in the final exam.

In [None]:
print("\nNumber of students in the two schools.\n")
print(df['school'].value_counts())
sns.pairplot(
    df2,
    x_vars=["G1", "G2"],
    y_vars=["G1", "G2", "G3"],
    hue="school"
)

## conclusion
#### * We notice in all exams that the scores of the students of School Gp are more than those of  School MS,  and this is not sufficient evidence of their failure because their number 349   and the number of School B 46.
#### * We also note that most of the schools have high grades.