In [25]:
# I have included the rubric provided on bootcampspot 
# (at https://courses.bootcampspot.com/courses/2319/assignments/38247?module_item_id=741063) 
# for reference. Items from the online rubric are commented out with 
# <this notation>.

# There are instances where the rubric differs from the starter code, particularly 
# where one calls for a value to be "displayed" while the other calls for the 
# value to be "stored". I wrote my original code to satisfy the online rubric, and 
# have commented code out to conform with the letter of the Deliverables in this 
# submission. Other items are worth zero points on the rubric, and have no effect on 
# the function of the code, so I have left those out. 

### Import required dependencies

In [26]:
import pandas as pd
import os

## Deliverable 1: Collect the Data

To collect the data that you’ll need, complete the following steps:

1. Using the Pandas `read_csv` function and the `os` module, import the data from the `new_full_student_data.csv` file, and create a DataFrame called student_df. 

2. Use the head function to confirm that Pandas properly imported the data.


In [27]:
# Create the path and import the data
# <The path to the file is built by using os.path.join. (2 points)>

student_data = os.path.join('..','School_District_Analysis','Resources', 'new_full_student_data.csv')

student_df = pd.read_csv(student_data)

In [28]:
# Verify that the data was properly imported

# <The DataFrame is created and named student_df. (2 points)>

student_df = pd.read_csv(student_data)
# <The first five rows of data are displayed. (1 points)>

student_df.head()

Unnamed: 0,student_id,student_name,grade,school_name,reading_score,math_score,school_type,school_budget
0,103880842,Travis Martin,9th,Sullivan High School,59.0,88.2,Public,961125
1,45069750,Michael Brown,9th,Dixon High School,94.7,73.5,Charter,870334
2,45024902,Gabriela Lucero,9th,Wagner High School,89.0,70.4,Public,846745
3,62582498,Susan Richardson,9th,Silva High School,69.7,80.3,Public,991918
4,16437227,Sherry Davis,11th,Bowers High School,,27.5,Public,848324


## Deliverable 2: Prepare the Data

To prepare and clean your data for analysis, complete the following steps:
    
1. Check for and remove all rows with `NaN`, or missing, values in the student DataFrame. 

2. Check for and remove all duplicate rows in the student DataFrame.

3. Use the `str.replace` function to remove the "th" from the grade levels in the grade column.

4. Check data types using the `dtypes` property.

5. Remove the "th" suffix from every value in the grade column using `str` and `replace`.

6. Change the grade colum to the `int` type and verify column types.

7. Use the head (and/or the tail) function to preview the DataFrame.

In [29]:
# Drop rows with null values and verify removal
# <After the removal of null values, isna().sum() displays 0 for all the columns. (5 points)>

student_df = student_df.dropna()
student_df.isna().sum()

student_id       0
student_name     0
grade            0
school_name      0
reading_score    0
math_score       0
school_type      0
school_budget    0
dtype: int64

In [30]:
# Drop duplicated rows and verify removal
# <After the removal of duplicates, duplicated().sum() displays no duplicates. (5 points)>

student_df = student_df.drop_duplicates()
student_df.duplicated().sum()

0

In [31]:
# Check data types
# <The column types are displayed. (5 points)>
student_df.dtypes

student_id         int64
student_name      object
grade             object
school_name       object
reading_score    float64
math_score       float64
school_type       object
school_budget      int64
dtype: object

In [32]:
# Remove the non-numeric characters and verify the contents of the column

# <The "th" suffix is removed from all the values in the "grade" column. (5 points)>

student_df["grade"] = student_df["grade"].str.replace("th", "")

student_df.head()


Unnamed: 0,student_id,student_name,grade,school_name,reading_score,math_score,school_type,school_budget
0,103880842,Travis Martin,9,Sullivan High School,59.0,88.2,Public,961125
1,45069750,Michael Brown,9,Dixon High School,94.7,73.5,Charter,870334
2,45024902,Gabriela Lucero,9,Wagner High School,89.0,70.4,Public,846745
3,62582498,Susan Richardson,9,Silva High School,69.7,80.3,Public,991918
5,74579444,Cynthia Johnson,9,Montgomery High School,63.5,76.9,Charter,893368


In [33]:
# Change the grade column to the int type and verify column types
# <The "grade" column is successfully converted to an int type. (5 points)>

student_df['grade'] = student_df["grade"].astype("int64")

student_df.dtypes

student_id         int64
student_name      object
grade              int64
school_name       object
reading_score    float64
math_score       float64
school_type       object
school_budget      int64
dtype: object

## Deliverable 3: Summarize the Data

Describe the data using summary statistics on the data as a whole and on individual columns.

1. Generate the summary statistics for each DataFrame by using the `describe` function.

2. Display the mean math score using the `mean` function. 

2. Store the minimum reading score as `min_reading_score`.

In [34]:
# Display summary statistics for the DataFrame
# <The summary statistics for the DataFrame are displayed. (6 points)>

student_df.describe()

Unnamed: 0,student_id,grade,reading_score,math_score,school_budget
count,14831.0,14831.0,14831.0,14831.0,14831.0
mean,69752960.0,10.355539,72.357865,64.675733,893742.749107
std,34529090.0,1.097728,15.22459,15.844093,53938.066467
min,10009060.0,9.0,10.5,3.7,817615.0
25%,39844330.0,9.0,62.2,54.5,846745.0
50%,69659780.0,10.0,73.8,65.3,893368.0
75%,99274490.0,11.0,84.0,76.0,956438.0
max,129999700.0,12.0,100.0,100.0,991918.0


In [35]:
# Display the mean math score using the mean function
# <The mean of the "math_score" column is displayed. (7 points)>

student_df["math_score"].mean()

64.6757332614119

In [36]:
# Store the minimum reading score as min_reading_score
# <The minimum of the "reading_score" column is STORED in min_reading_score. (7 points)>

min_reading_score = student_df["reading_score"].min()
#min_reading_score

## Deliverable 4: Drill Down into the Data

Drill down to specific rows, columns, and subsets of the data.

To drill down into the data, complete the following steps:

1. Use `loc` to display the grade column.

2. Use `iloc` to display the first 3 rows and columns 3, 4, and 5.

3. Show the rows for grade nine using `loc`.

4. Store the row with the minimum overall reading score as `min_reading_row` using `loc` and the `min_reading_score` found in Deliverable 3.

5. Find the reading scores for the school and grade from the output of step three using `loc` with multiple conditional statements.

6. Using conditional statements and `loc` or `iloc`, find the mean reading score for all students in grades 11 and 12 combined.

In [37]:
# Use loc to display the grade column
# <The "grade" column is displayed. (4 points)>

filter = student_df["grade"] > 8
student_df.loc[filter]["grade"]

0         9
1         9
2         9
3         9
5         9
         ..
19508    10
19509    12
19511    11
19512    11
19513    12
Name: grade, Length: 14831, dtype: int64

In [38]:
# Use `iloc` to display the first 3 rows and columns 3, 4, and 5.

# <The first three rows of Columns 3, 4, and 5 are displayed. (4 points)

student_df.iloc[0 : 3, [3, 4, 5]]

Unnamed: 0,school_name,reading_score,math_score
0,Sullivan High School,59.0,88.2
1,Dixon High School,94.7,73.5
2,Wagner High School,89.0,70.4


In [39]:
# Select the rows for grade nine and display their summary statistics using `loc` and `describe`.
# <The summary statistics for 9th graders are displayed. (4 points)>

student_nine_df = student_df.loc[(student_df["grade"] == 9)]
student_nine_df.describe()


Unnamed: 0,student_id,grade,reading_score,math_score,school_budget
count,4132.0,4132.0,4132.0,4132.0,4132.0
mean,69794410.0,9.0,69.236713,66.585624,898692.606002
std,34705650.0,0.0,15.277354,16.661533,54891.596611
min,10009060.0,9.0,17.9,5.3,817615.0
25%,39538480.0,9.0,59.0,56.0,846745.0
50%,69840370.0,9.0,70.05,67.8,893368.0
75%,99395040.0,9.0,80.5,78.5,957299.0
max,129999700.0,9.0,99.9,100.0,991918.0


In [40]:
# Store the row with the minimum overall reading score as `min_reading_row`
# using `loc` and the `min_reading_score` found in Deliverable 3.

# <*The row that contains the minimum reading score is displayed. (4 points)>

# Don't take four points off for not displaying this line, you ask 
# me to store, not display.

min_reading_row = student_df.loc[(student_df["reading_score"] <= min_reading_score)]
#min_reading_row

In [41]:
# Use loc with conditionals to select all reading scores from 10th graders at Dixon High School.

# <The reading scores of the 10th graders at Dixon High School is displayed. (4 points)>

student_dixon_df = student_df.loc[(student_df["school_name"] == "Dixon High School") & (student_df["grade"] == 10)]
student_dixon_df.loc[:,["school_name", "reading_score"]]

Unnamed: 0,school_name,reading_score
45,Dixon High School,71.1
60,Dixon High School,59.5
69,Dixon High School,88.6
94,Dixon High School,81.5
100,Dixon High School,95.3
...,...,...
19283,Dixon High School,52.9
19306,Dixon High School,58.0
19344,Dixon High School,38.0
19368,Dixon High School,84.4


In [42]:
# Find the mean reading score for all students in grades 11 and 12 combined.

# <The average reading score of all the students in Grades 11 and 12 combined is calculated. (5 points)>

student_11_12_df = student_df.loc[(student_df["grade"] == 11) | (student_df["grade"] == 12)]

student_11_12_df["reading_score"].mean()

74.90038089192191

## Deliverable 5: Make Comparisons Between District and Charter Schools

Compare district vs charter schools for budget, size, and scores.

Make comparisons within your data by completing the following steps:

1. Using the `groupby` and `mean` functions, look at the average reading and math scores per school type.

1. Using the `groupby` and `count` functions, find the total number of students at each school.

2. Using the `groupby` and `mean` functions, find the average budget per grade for each school type.

In [43]:
# Use groupby and mean to find the average reading and math scores for each school type.


student_school_df = student_df.groupby(by = "school_type").mean()
student_school_df.loc[:,["math_score", "reading_score"]]

Unnamed: 0_level_0,math_score,reading_score
school_type,Unnamed: 1_level_1,Unnamed: 2_level_1
Charter,66.761883,72.450603
Public,62.951576,72.281219


In [44]:
# Use the `groupby`, `count`, and `sort_values` functions to find the
# total number of students at each school and sort from most students to least students.

#The total number of students per school is displayed in descending order. (6 points)

students_per_school_df = student_df.groupby(["school_name"]).count()

students_per_school_df.iloc[:,0:1].sort_values("student_id", ascending = False)

Unnamed: 0_level_0,student_id
school_name,Unnamed: 1_level_1
Montgomery High School,2038
Green High School,1961
Dixon High School,1583
Wagner High School,1541
Silva High School,1109
Woods High School,1052
Sullivan High School,971
Turner High School,846
Bowers High School,803
Fisher High School,798


In [51]:
# Using the `groupby` and `mean` functions, find the average budget per grade for each school type
# <The average budget for each school type is displayed. (7 points)>

student_school_df = student_df.groupby(['school_type', 'grade']).mean()
student_school_df.loc[:,["school_budget"]]

Unnamed: 0_level_0,Unnamed: 1_level_0,school_budget
school_type,grade,Unnamed: 2_level_1
Charter,9,863817.29013
Charter,10,871823.608811
Charter,11,874262.713649
Charter,12,885096.335017
Public,9,926800.159528
Public,10,914715.360382
Public,11,900248.905136
Public,12,895952.915971


# Deliverable 6: Summarize Your Findings
In the cell below, write a few sentences to describe any discoveries you made while performing your analysis along with any additional analysis you believe would be worthwhile.

Using the provided testing data for PyCity Schools, we were asked 
to perform various types of analysis using Pandas on a .csv file.

Output was to be filtered and formatted using a variety of methods 
such as loc and iloc, and various conditional statements.

This was an interesting project, in that there were multiple ways 
to do many of the required operations. I found the practice of doing 
the operation with both loc and iloc to be very helpful in understanding 
their attributes. Additionally, the speed with which Pandas is able to 
slice lots of data is impressive.

This exposure to Python and Pandas has opened a potential door for me. 
Building on this exposure I see how it is possible to use data gathered 
in the real world by inexpensive sensors to glean insights into vehicle 
dynamics. Matplotlib and other libraries are sure to come in handy.



------------Additional analysis I believe is worthwhile------------

It was educational to find that writing my assignment to 
the structure of the online rubric using my practice workbook, and not
using your provided starter code resulted in a grade of zero. 

I thought I was saving time by working the challenge in parallel with the 
modules, applying the lessons directly to the weekly challenge as I worked 
through the week. When I finished the modules the Weekly Challenge was 
almost done! I just had to polish it up, make sure it conformed with the 
rubric on the Boot Camp website 
(https://courses.bootcampspot.com/courses/2319/assignments/38247?module_item_id=741063) 
and I was good to go! Nowhere on that page does it say you MUST use the 
provided Challenge starter code.

Even though my initial submission was both on time and contained every 
bit of code requested - organized in the exact order in which it was 
requested - I got a zero. It was as if the grader ran a comparison 
function without even opening the file.

"Please submit the correct Module 4 challenge file for grading this 
assignment". 

You may as well have told me "PC Load Letter". WTF does that mean?

This is, of course, some <----REDACTED---->, and no one would tell me 
what I'd done wrong, which is further <----REDACTED---->. I asked in 
the comments, I asked my dude in charge of the class. I would have 
asked my instructor but he got fired and replaced this week, sooo....

I am submitting this in the hopes that our TA is guessing right and it 
was just the fact that you guys are <----------REDACTED---------->. She 
assures me that "Central Grading" are real people, but I'm not convinced.

01110011 01101111 00101110 00101110 00101110 01101001 01100110 
01111001 01101111 01110101 01100001 01110010 01100101 01100001 
01110010 01101111 01100010 01101111 01110100 01100111 01101111 
01100110 01101111 01110010 01101101 01100001 01110100 01111001 
01101111 01110101 01110010 01110011 01100101 01101100 01100110

