### Import required dependencies

In [1]:
import numpy as np
import pandas as pd
import os

## Deliverable 1: Collect the Data

To collect the data that you’ll need, complete the following steps:

1. Using the Pandas `read_csv` function and the `os` module, import the data from the `new_full_student_data.csv` file, and create a DataFrame called student_df. 

2. Use the head function to confirm that Pandas properly imported the data.


In [2]:
# Create the path and import the data
student_data = os.path.join('/','/Users/Jorge/Desktop/BCanvas/Modulo_ PPandas/', 'new_full_student_data.csv')
student_df = pd.read_csv(student_data)

In [3]:
# Verify that the data was properly imported
student_df.head()

Unnamed: 0,student_id,student_name,grade,school_name,reading_score,math_score,school_type,school_budget
0,103880842,Travis Martin,9th,Sullivan High School,59.0,88.2,Public,961125
1,45069750,Michael Brown,9th,Dixon High School,94.7,73.5,Charter,870334
2,45024902,Gabriela Lucero,9th,Wagner High School,89.0,70.4,Public,846745
3,62582498,Susan Richardson,9th,Silva High School,69.7,80.3,Public,991918
4,16437227,Sherry Davis,11th,Bowers High School,,27.5,Public,848324




## Deliverable 2: Prepare the Data

1.	In the student DataFrame, check for rows that have NaN (or missing) values, and remove those rows, as the following image shows:![image-2.png](attachment:image-2.png)
2.	In the student DataFrame, check for duplicate rows, and remove them.
3.	Check the data types of the columns by using the dtypes property, as the following image shows:![image-3.png](attachment:image-3.png)
 4.	In the grade column, remove the "th" suffix from every value by using str and replace, as the following image shows:![image-4.png](attachment:image-4.png) 
5.	Change the "grade" column to the int type, and then verify the column types, as the following image shows:![image.png](attachment:image.png)





In [4]:
# Check for null values and remove those rows
student_df.isna().sum()
student_df=student_df.dropna()
student_df.isna().sum()

student_id       0
student_name     0
grade            0
school_name      0
reading_score    0
math_score       0
school_type      0
school_budget    0
dtype: int64

In [5]:
# Check for duplicate rows
student_df.duplicated().sum()

1836

In [6]:
# Remove duplicated rows
student_df=student_df.drop_duplicates()
student_df.duplicated().sum()

0

In [7]:
# Check data types
student_df.dtypes

student_id         int64
student_name      object
grade             object
school_name       object
reading_score    float64
math_score       float64
school_type       object
school_budget      int64
dtype: object

In [8]:
#In the grade column, remove the "th" suffix from every value by using str and replace
student_df["grade"]=student_df["grade"].str.replace("th","")
student_df["grade"]

0         9
1         9
2         9
3         9
5         9
         ..
19508    10
19509    12
19511    11
19512    11
19513    12
Name: grade, Length: 14831, dtype: object

In [9]:
# Change the "grade" column to the int type, and then verify the column types
student_df["grade"]=student_df["grade"].astype("int")
student_df.dtypes

student_id         int64
student_name      object
grade              int32
school_name       object
reading_score    float64
math_score       float64
school_type       object
school_budget      int64
dtype: object



## Deliverable 3: Summarize the Data

1.	Generate the summary statistics for the student DataFrame by using the describe function.
2.	Display the mean math score by using the mean function.
3.	Store the minimum reading score in min_reading_score.


In [10]:
# Display summary statistics for the DataFrame
student_df.describe()

Unnamed: 0,student_id,grade,reading_score,math_score,school_budget
count,14831.0,14831.0,14831.0,14831.0,14831.0
mean,69752960.0,10.355539,72.357865,64.675733,893742.749107
std,34529090.0,1.097728,15.22459,15.844093,53938.066467
min,10009060.0,9.0,10.5,3.7,817615.0
25%,39844330.0,9.0,62.2,54.5,846745.0
50%,69659780.0,10.0,73.8,65.3,893368.0
75%,99274490.0,11.0,84.0,76.0,956438.0
max,129999700.0,12.0,100.0,100.0,991918.0


In [11]:
# Display the mean math score using the mean function
mean_math=student_df["math_score"].mean()
mean_math

64.67573326141189

In [12]:
# Store the minimum reading score as min_reading_score
min_reading_score=student_df["reading_score"].min()
min_reading_score

10.5



## Deliverable 4: Drill Down into the Data



1.	Display the grade column by using loc, as the following image 
2.	Display the first three rows of Columns 3, 4, and 5 by using iloc, 
3.	Select the rows for Grade 9, and display their summary statistics by using loc and describe, 
4.	Store the row with the min overall reading score in min_reading_row by using loc and the min_reading_score variable from Deliverable 3.
5.	Select all the reading scores from the 10th graders at Dixon High School by using loc with conditionals.
6.  Find the mean reading score for all the students in Grades 11 and 12 combined by using conditional statements and loc or iloc.




In [13]:
# Use loc to display the grade column
student_df.loc[:,"grade"]

0         9
1         9
2         9
3         9
5         9
         ..
19508    10
19509    12
19511    11
19512    11
19513    12
Name: grade, Length: 14831, dtype: int32

In [14]:
# Use `iloc` to display the first 3 rows and columns 3, 4, and 5.
student_df.iloc[0:3,3:6]

Unnamed: 0,school_name,reading_score,math_score
0,Sullivan High School,59.0,88.2
1,Dixon High School,94.7,73.5
2,Wagner High School,89.0,70.4


In [15]:
# Select the rows for grade nine and display their summary statistics using `loc` and `describe`.
grade9_df=student_df.loc[student_df["grade"]==9]
grade9_df.describe()

Unnamed: 0,student_id,grade,reading_score,math_score,school_budget
count,4132.0,4132.0,4132.0,4132.0,4132.0
mean,69794410.0,9.0,69.236713,66.585624,898692.606002
std,34705650.0,0.0,15.277354,16.661533,54891.596611
min,10009060.0,9.0,17.9,5.3,817615.0
25%,39538480.0,9.0,59.0,56.0,846745.0
50%,69840370.0,9.0,70.05,67.8,893368.0
75%,99395040.0,9.0,80.5,78.5,957299.0
max,129999700.0,9.0,99.9,100.0,991918.0


In [16]:
# Store the row with the minimum overall reading score as `min_reading_row`
# using `loc` and the `min_reading_score` found in Deliverable 3.
min_reading_score= student_df["reading_score"].min()
min_reading_row = student_df.loc[student_df["reading_score"] == min_reading_score]
min_reading_row

Unnamed: 0,student_id,student_name,grade,school_name,reading_score,math_score,school_type,school_budget
3706,81758630,Matthew Thomas,10,Dixon High School,10.5,58.4,Charter,870334


In [17]:
# Use loc with conditionals to select all reading scores from 10th graders at Dixon High School.
student_df.loc[(student_df["school_name"]=="Dixon High School")& (student_df["grade"]==10),["school_name","reading_score"]]

Unnamed: 0,school_name,reading_score
45,Dixon High School,71.1
60,Dixon High School,59.5
69,Dixon High School,88.6
94,Dixon High School,81.5
100,Dixon High School,95.3
...,...,...
19283,Dixon High School,52.9
19306,Dixon High School,58.0
19344,Dixon High School,38.0
19368,Dixon High School,84.4


In [27]:
# Find the mean reading score for all students in grades 11 and 12 combined.
average_reading= student_df.loc[(student_df["grade"]>=11),["reading_score"]].mean()
average_reading

reading_score    74.900381
dtype: float64


## Deliverable 5: Make Comparisons 

1. Display the average budget for each school type by using the groupby and mean functions.
2. Find the total number of students at each school, and sort those numbers from largest to smallest by using the groupby, count, and sort_values functions.
3. Find the average math score by grade for each school type by using the groupby and mean functions.

In [32]:
# Use groupby and mean to find the average budget for each school type.
abudget_df=student_df.groupby("school_type").mean()
abudget_df.loc[:,["school_budget"]]

Unnamed: 0_level_0,school_budget
school_type,Unnamed: 1_level_1
Charter,872625.656236
Public,911195.558251


In [36]:
# Use the `groupby`, `count`, and `sort_values` functions to find the total number of students at each school and sort from most students to least students.
student_count=student_df.groupby("school_name").count()
student_count=student_count.rename(columns = {'student_id':'student_count'})
student_count.loc[:,['student_count']].sort_values('student_count', ascending=False)

Unnamed: 0_level_0,student_count
school_name,Unnamed: 1_level_1
Montgomery High School,2038
Green High School,1961
Dixon High School,1583
Wagner High School,1541
Silva High School,1109
Woods High School,1052
Sullivan High School,971
Turner High School,846
Bowers High School,803
Fisher High School,798


In [41]:
#average math score by grade for each school type by using the groupby and mean functions
grade_mean_df=student_df[["math_score", "school_type", "grade"]].groupby(["school_type", "grade"]).mean()
grade_mean_df

Unnamed: 0_level_0,Unnamed: 1_level_0,math_score
school_type,grade,Unnamed: 2_level_1
Charter,9,70.077874
Charter,10,66.443206
Charter,11,68.024735
Charter,12,60.212121
Public,9,63.771066
Public,10,63.764121
Public,11,59.314337
Public,12,63.568319



# Deliverable 6: Summarize Your Findings
In the cell below, write a few sentences to describe any discoveries you made while performing your analysis along with any additional analysis you believe would be worthwhile.

Public schools have more funds to spend during the school year but at the same time they have more students than Charter schools. Overall the Math scores are slightly higher in Charter Schools.Other analysis can be performed as number students per grade vs. math and reading scores, school name vs. budget, grade vs.budget.