## EXPLORATORY DATA ANALYSIS
by [Iqrar Agalosi Nureyza](https://www.kaggle.com/iqrar99)

Hello Everyone! I'm a student and I try my best to do data analysis. This Student Performance Dataset is a very good dataset to sharper your analysis skill. I hope you can understand my analysis.

**Table of Contents**
1. [Basic Analysis](#1)
    * [Frequency](#2)
    * [Male vs Female](#MF)
    * [Top 30 Students](#T)

In [1]:
#importing all important packages
import numpy as np #linear algebra
import pandas as pd #data processing

In [2]:
#input data
data = pd.read_csv('../input/students-performance-in-exams/StudentsPerformance.csv')
data.head(10)

Unnamed: 0,gender,race/ethnicity,parental level of education,lunch,test preparation course,math score,reading score,writing score
0,female,group B,bachelor's degree,standard,none,72,72,74
1,female,group C,some college,standard,completed,69,90,88
2,female,group B,master's degree,standard,none,90,95,93
3,male,group A,associate's degree,free/reduced,none,47,57,44
4,male,group C,some college,standard,none,76,78,75
5,female,group B,associate's degree,standard,none,71,83,78
6,female,group B,some college,standard,completed,88,95,92
7,male,group B,some college,free/reduced,none,40,43,39
8,male,group D,high school,free/reduced,completed,64,64,67
9,female,group B,high school,free/reduced,none,38,60,50


In [3]:
data.info() #checking data type for each column

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1000 entries, 0 to 999
Data columns (total 8 columns):
gender                         1000 non-null object
race/ethnicity                 1000 non-null object
parental level of education    1000 non-null object
lunch                          1000 non-null object
test preparation course        1000 non-null object
math score                     1000 non-null int64
reading score                  1000 non-null int64
writing score                  1000 non-null int64
dtypes: int64(3), object(5)
memory usage: 62.6+ KB


In [4]:
data.shape

(1000, 8)

Our data has 8 columns and 1000 rows. Let's do some basic analysis.

<a id ="1"></a>
### Basic Analysis

In [5]:
data.describe() #starter code to find out some basic statistical insights

Unnamed: 0,math score,reading score,writing score
count,1000.0,1000.0,1000.0
mean,66.089,69.169,68.054
std,15.16308,14.600192,15.195657
min,0.0,17.0,10.0
25%,57.0,59.0,57.75
50%,66.0,70.0,69.0
75%,77.0,79.0,79.0
max,100.0,100.0,100.0


Now, Let's check for any missing values

In [6]:
print(data.isna().sum())

gender                         0
race/ethnicity                 0
parental level of education    0
lunch                          0
test preparation course        0
math score                     0
reading score                  0
writing score                  0
dtype: int64


There are no missing values in our data. So, we don't need to drop any values.

<a id = "2"></a>
#### Frequency
First, we look at the student's gender.

In [7]:
data['gender'].value_counts()

female    518
male      482
Name: gender, dtype: int64

more female than male in this data.

In [8]:
data.iloc[:,1].value_counts()

group C    319
group D    262
group B    190
group E    140
group A     89
Name: race/ethnicity, dtype: int64

As we can see, group C has the most members compared to the others

In [9]:
data.iloc[:,2].value_counts()

some college          226
associate's degree    222
high school           196
some high school      179
bachelor's degree     118
master's degree        59
Name: parental level of education, dtype: int64

In [10]:
data.iloc[:,3].value_counts()

standard        645
free/reduced    355
Name: lunch, dtype: int64

Many students have standard lunch.

In [11]:
data.iloc[:,4].value_counts()

none         642
completed    358
Name: test preparation course, dtype: int64

It seems that the total number of people who did not complete the course was double that of those who completed the course

<a id = "MF"></a>
#### Male vs Female
We will see who is better in math, reading, and writing.

In [12]:
male = data[data['gender'] == 'male']
female = data[data['gender'] != 'male']

print("Math Score")
print("Male    :",round(male['math score'].sum()/len(male),3))
print("Female  :",round(female['math score'].sum()/len(female),3),'\n')

print("Reading Score")
print("Male    :",round(male['reading score'].sum()/len(male),3))
print("Female  :",round(female['reading score'].sum()/len(female),3),'\n')

print("Writing Score")
print("Male    :",round(male['writing score'].sum()/len(male),3))
print("Female  :",round(female['writing score'].sum()/len(female),3))

Math Score
Male    : 68.728
Female  : 63.633 

Reading Score
Male    : 65.473
Female  : 72.608 

Writing Score
Male    : 63.311
Female  : 72.467


Male students are better in math. Female students are good in writing and reading.

<a id = "T"></a>
#### Top 30 Students
let's look at all the top 30 students who get very high scores

In [13]:
scores = pd.DataFrame(data['math score'] + data['reading score'] + data['writing score'], columns = ["total score"])
scores = pd.merge(data,scores, left_index = True, right_index = True).sort_values(by=['total score'],ascending=False)
scores.head(30)

Unnamed: 0,gender,race/ethnicity,parental level of education,lunch,test preparation course,math score,reading score,writing score,total score
916,male,group E,bachelor's degree,standard,completed,100,100,100,300
458,female,group E,bachelor's degree,standard,none,100,100,100,300
962,female,group E,associate's degree,standard,none,100,100,100,300
114,female,group E,bachelor's degree,standard,completed,99,100,100,299
179,female,group D,some high school,standard,completed,97,100,100,297
712,female,group D,some college,standard,none,98,100,99,297
165,female,group C,bachelor's degree,standard,completed,96,100,100,296
625,male,group D,some college,standard,completed,100,97,99,296
903,female,group D,bachelor's degree,free/reduced,completed,93,100,100,293
149,male,group E,associate's degree,free/reduced,completed,100,100,93,293


It seems that there are 3 students who are geniuses here, they get perfect scores for all subjects. But, 2 of them didn't complete their test preparation course. Only 2 possibilities: **Genius** or **Cheating**.

From the data, we have seen that students with standard luch have better score than free/reduce lunch.

____________________________