## Student Data Analysis

In this activity, you will use the steps below to analyze a dataset of student test scores from schools in a fake school district.

1. Collect the data.

2. Prepare the data.

3. Summarize the data. 

4. Drill down into the data. 

5. Make comparisons. 



### Import required libraries and dependencies

<!-- https://pypi.org/project/pathlib2/ -->

In [1]:
import pandas as pd
import os
import numpy as np

## Step 1: Collect the data.

To collect the data that you’ll need, complete the following steps:

**1. Using the Pandas `read_csv` function and the `os.path.join` function, import the data from the `new_student_data.csv` file, and create a DataFrame called student_df.**

In [2]:
student_data = os.path.join('../Resources/new_student_data.csv')
student_df = pd.read_csv(student_data)
student_df

Unnamed: 0,student_id,student_name,grade,school_name,reading_score,math_score,school_type
0,127008367,Sarah Douglas,11th,Chang High School,87.2,64.1,Public
1,33365505,Francisco Osborne,9th,Fisher High School,,,Public
2,44359500,Ryan Haas,12th,Campbell High School,91.6,54.7,Public
3,24791243,Kathryn Mack,11th,Richard High School,68.9,73.3,Charter
4,121467881,Harold Reynolds,12th,Chang High School,68.7,43.4,Public
...,...,...,...,...,...,...,...
13935,32277979,Kelly Myers,10th,Sullivan High School,62.3,37.9,Public
13936,109412748,Kimberly Burke,10th,Montgomery High School,99.5,89.8,Public
13937,16856426,Crystal Merritt,9th,Turner High School,86.3,71.1,Public
13938,88213835,Misty Wiggins,10th,Fisher High School,75.4,76.4,Public


**2. Use the head (and/or the tail) function to confirm that Pandas properly imported the data.**

In [3]:
student_df.head()

Unnamed: 0,student_id,student_name,grade,school_name,reading_score,math_score,school_type
0,127008367,Sarah Douglas,11th,Chang High School,87.2,64.1,Public
1,33365505,Francisco Osborne,9th,Fisher High School,,,Public
2,44359500,Ryan Haas,12th,Campbell High School,91.6,54.7,Public
3,24791243,Kathryn Mack,11th,Richard High School,68.9,73.3,Charter
4,121467881,Harold Reynolds,12th,Chang High School,68.7,43.4,Public


## Good work!

You are now prepared to start the next lesson before starting step 2.

In [4]:
#1 Check for NaN values 
student_df.isnull()
student_df.isnull().sum()

student_id          0
student_name        0
grade               0
school_name         0
reading_score    1414
math_score        705
school_type         0
dtype: int64

In [12]:
#Remove the NaN values
student_df.dropna(subset=['reading_score', 'math_score'], inplace=True)
student_df.isnull().sum()

student_id       0
student_name     0
grade            0
school_name      0
reading_score    0
math_score       0
school_type      0
dtype: int64

In [13]:
student_df

Unnamed: 0,student_id,student_name,grade,school_name,reading_score,math_score,school_type
0,127008367,Sarah Douglas,11th,Chang High School,87.2,64.1,Public
2,44359500,Ryan Haas,12th,Campbell High School,91.6,54.7,Public
3,24791243,Kathryn Mack,11th,Richard High School,68.9,73.3,Charter
4,121467881,Harold Reynolds,12th,Chang High School,68.7,43.4,Public
5,79397676,Kyle Brooks,9th,Turner High School,72.6,55.4,Public
...,...,...,...,...,...,...,...
13935,32277979,Kelly Myers,10th,Sullivan High School,62.3,37.9,Public
13936,109412748,Kimberly Burke,10th,Montgomery High School,99.5,89.8,Public
13937,16856426,Crystal Merritt,9th,Turner High School,86.3,71.1,Public
13938,88213835,Misty Wiggins,10th,Fisher High School,75.4,76.4,Public


In [15]:
#2 Find duplicated rows
student_df.duplicated().sum()

1299

In [18]:
#Remove duplicated rows
student_df = student_df.drop_duplicates()
student_df

Unnamed: 0,student_id,student_name,grade,school_name,reading_score,math_score,school_type
0,127008367,Sarah Douglas,11th,Chang High School,87.2,64.1,Public
2,44359500,Ryan Haas,12th,Campbell High School,91.6,54.7,Public
3,24791243,Kathryn Mack,11th,Richard High School,68.9,73.3,Charter
4,121467881,Harold Reynolds,12th,Chang High School,68.7,43.4,Public
5,79397676,Kyle Brooks,9th,Turner High School,72.6,55.4,Public
...,...,...,...,...,...,...,...
13935,32277979,Kelly Myers,10th,Sullivan High School,62.3,37.9,Public
13936,109412748,Kimberly Burke,10th,Montgomery High School,99.5,89.8,Public
13937,16856426,Crystal Merritt,9th,Turner High School,86.3,71.1,Public
13938,88213835,Misty Wiggins,10th,Fisher High School,75.4,76.4,Public


In [19]:
student_df.duplicated().sum()

0

In [24]:
# Check datatypes
student_df.dtypes

student_id         int64
student_name      object
grade             object
school_name       object
reading_score    float64
math_score       float64
school_type       object
dtype: object

In [27]:
# In the grade column, remove the "th" suffix
student_df['grade'] = student_df['grade'].str.replace('th', '')
student_df

Unnamed: 0,student_id,student_name,grade,school_name,reading_score,math_score,school_type
0,127008367,Sarah Douglas,11,Chang High School,87.2,64.1,Public
2,44359500,Ryan Haas,12,Campbell High School,91.6,54.7,Public
3,24791243,Kathryn Mack,11,Richard High School,68.9,73.3,Charter
4,121467881,Harold Reynolds,12,Chang High School,68.7,43.4,Public
5,79397676,Kyle Brooks,9,Turner High School,72.6,55.4,Public
...,...,...,...,...,...,...,...
13935,32277979,Kelly Myers,10,Sullivan High School,62.3,37.9,Public
13936,109412748,Kimberly Burke,10,Montgomery High School,99.5,89.8,Public
13937,16856426,Crystal Merritt,9,Turner High School,86.3,71.1,Public
13938,88213835,Misty Wiggins,10,Fisher High School,75.4,76.4,Public


In [28]:
# Change the grade column to the int type
student_df['grade'] = student_df['grade'].astype(int)
student_df.dtypes

student_id         int64
student_name      object
grade              int32
school_name       object
reading_score    float64
math_score       float64
school_type       object
dtype: object

In [32]:
#3 Generate the summary statistics
student_df.describe()

Unnamed: 0,student_id,grade,reading_score,math_score
count,10604.0,10604.0,10604.0,10604.0
mean,69719530.0,10.566013,75.241513,64.343248
std,34708510.0,1.128907,14.283955,16.662284
min,10001320.0,9.0,9.5,1.4
25%,39746260.0,10.0,65.9,52.7
50%,69963680.0,11.0,76.4,65.0
75%,99844400.0,12.0,86.3,76.4
max,129990300.0,12.0,100.0,100.0


In [39]:
student_df.mean()

  student_df.mean()


student_id       6.971953e+07
grade            1.056601e+01
reading_score    7.524151e+01
math_score       6.434325e+01
dtype: float64

In [38]:
# Display the mean
student_df['math_score'].mean()

64.34324783100718

In [42]:
# Store the minimum reading score in min_reading_score
min_reading_score = student_df['reading_score'].min()
min_reading_score

9.5

In [50]:
#4 Drill Down into the Data

# Display the grade column using loc
student_df.loc[:, "grade"]

0        11
2        12
3        11
4        12
5         9
         ..
13935    10
13936    10
13937     9
13938    10
13939    11
Name: grade, Length: 10604, dtype: int32

In [58]:
# Display the first 3 rows of Columns 3, 4, 5 using iloc
student_df.iloc[0:3, 3:6]

Unnamed: 0,school_name,reading_score,math_score
0,Chang High School,87.2,64.1
2,Campbell High School,91.6,54.7
3,Richard High School,68.9,73.3
