# Introduction to Data Science - Project- School Performance

Student Name:

## Problem Statement

The mayor of New York, a former data scientist, is particularly interested in the math scores in the New York school districts.  The mayor has asked you to analyze the test scores across the New York school districts.  The mayor is looking for insights or trends to improve the quality of the New York schools.  She has provided you with a set of scores from across the city.

## Data Sources ##

The New York math tests results are at the following location:

https://catalog.data.gov/dataset/2006-2012-math-test-results-district-all-students

The first few columns in the file are: District,Grade,Year,Demographic,Number Tested, and Mean Scale Score.

## Acquiring Data

### <span style="color:red">Task 1</span>

Download a local copy of the school data.

### <span style="color:red">Task 2</span>

Load the data into a dataframe called "schools" and display the contents of the resulting dataframe.

## Exploring the Data

### <span style="color:red">Task 3</span>

Display the values and associated frequencies of the grade column. (Hint: see value_counts)

### <span style="color:red">Task 4</span>

Display the values and associated frequencies of the district column, sorting the result by district number.

### <span style="color:red">Task 5</span>

Display the values and associated frequencies of the year column, sorting the result by year.

## Preparing Data

### <span style="color:red">Task 6</span>

Display all the rows that have the do not have an explicit grade in the Grade column. (Hint: Look at the data and see what other value is in that column besides a specific grade and use that for selection)

### <span style="color:red">Task 7</span>

Remove all the rows from "schools" (in place) that have the do not have an explicit grade in the Grade column.
(Hint: One option is to use drop with index and inplace argument. Another is to create a new dataframe named schools by selecting the rows you want to keep)

### <span style="color:red">Task 8</span>

Confirm that the 'All Grades' rows have been removed by doing the value counts on the Grade field again.


## Exploring the Data

### <span style="color:red">Task 9</span>

How many students total were tested in each grade level? (Hint: See groupby, select column you need, and then apply the aggregation function. See the panda_dataframe_project_helper notebook)

### <span style="color:red">Task 10</span>

How many students total were tested in each district?  Sort the result in descending order.  (Use the *sort_values* method after performing the sum).

### <span style="color:red">Task 11</span>

Use the *describe* method to show a set of descriptive statistics about the 'Mean Scale Score' across all districts for each grade. (Hint: Remember to groupby and select the column you want before you invoke describe)

### <span style="color:red">Task 12</span>

Show just the top mean scale score (across all years) for each grade.
(Hint: Similar to getting the sum, use a different aggregation/groupby method)

## Visualizing the Data

### <span style="color:red">Task 13</span>

What is a Histogram?  A histogram is used to summarize discrete or continuous data. In other words, it provides a visual interpretation. of numerical data by showing the number of data points that fall within a specified range of values (called “bins”). It is similar to a vertical bar graph.

Look up the DataFrame *hist* method that produces a histogram.  Use this method to produce a histogram for the 'Mean Scale Score' for each grade.
Hint: Use groupby, invoke the hist method, and pass the column  you want to use



### <span style="color:red">Task 14</span>

Look up the DataFrame *plot* method.  Use this method to plot the 'Mean Scale Score' on the y-axis and 'District' on the x-axis for the entire dataframe. Use the 'o' style for the plot. The 'o' tell matplotlib to use dots instead of lines.  We are telling plot to make a point for the 'Mean Scale Score' by District.

### <span style="color:red">Task 15</span>

What insights do you draw from the plot above?

## Analyzing the Data

### <span style="color:red">Task 16</span>

Define a function *top* that returns the *n* rows with the highest value for the specified *column*.

*top* should accept a dataframe as its first input, a parameter named *n* that accepts a number and provides a reasonable default, and a parameter called *column* that defaults to 'Mean Scale Score'.

Demonstrate the function against the entire 'schools' dataframe.

(Hint: See panda_dataframe_part5 notebook for an example)

### <span style="color:red">Task 17</span>

Use the *apply* method and your defined *top* function to display the full row for the top score in each grade.

(Hint: use groupby and then apply)

### <span style="color:red">Task 18</span>

What insight did you get from the previous cell?

### <span style="color:red">Task 19</span>

Extend your selection above to show the row for the top score for each combination of grade and year.
(Hint: extend your groupby)

### <span style="color:red">Task 20</span>

What insight did you get from the previous cell?

### <span style="color:red">Task 21</span>
Is there a trend on scores in the best district?  

Using the DataFrame *plot* method, plot the performance by grade of the top distict across time.

(Hint: Use selection, then groupby, then plot.  Do not use a scatter plot for this task.  You want a basic plot that shows the trend over time/years. )

Over time, is the district peformance improving, deteriorating, or staying the same for each grade?

### <span style="color:red">Task 22</span>

Define a function *bottom* that returns the *n* rows with the lowest value for the specified *column*.

*bottom* should accept a dataframe as its first input, a parameter named *n* that accepts a number and provides a reasonable default, and a paramater called *column* that defaults to 'Mean Scale Score'.

(Hint: This is similar to top)

Demonstrate the function against the entire 'schools' dataframe.

### <span style="color:red">Task 23</span>

Use the *apply* method and your defined *bottom* function to display the full row for the bottom score in each grade.

### <span style="color:red">Task 24</span>

What insight did you get from the previous cell?

### <span style="color:red">Task 25</span>

Extend your selection above to show the row for the bottom score for the combination of grade and year.

### <span style="color:red">Task 26</span>

What insight did you get from the previous cell?

## Results

The Mayor wants to recognize the top performing districts and direct additional resources to assist lower performing districts.

She asks you to rank the scores by performance as follows.  

For each grade and year, rank each district based upon their 'Mean Scale Score.'  The district with the highest 'Mean Scale Score' should get a rank of 1, the second highest should get a rank of 2, etc.

After ranking for each grade and year, sum the ranks for each grade over all years.  For the Mayor's purpose, the districts with the lowest total sum of the ranks (i.e., the lowest rank numbers overall) are considered the best performing schools.

This task is going to take a bit of work, so let's break the problem into incremental chunks of work.

### <span style="color:red">Task 27</span>

Let's make a smaller dataframe to use while we are working out the larger problem.  

Create a dataframe called 'schools_subset' from 'schools' that includes only Grade 3 for the year 2012.  We are not going to need all the columns, so only add district, grade, year, and mean scale score to the new dataframe.

Display the schools_subset dataframe.

### <span style="color:red">Task 28</span>

Look up the DataFrame *rank* method.  We will use this method to set the ranks.  Since we want the highest rank to be 1, we need to set the 'ascending' parameter to the *rank* call to False.

Make a function called *add_default_rank* that takes a dataframe the first parameter and a 'column' parameter with the default 'Mean Scale Score' as the second paramter.

This function should create a new column called 'Default Rank' in the passed dataframe.  The value of the new column should be the *rank* for the passed 'column' parameter.

Demonstrate your *add_default_rank* function works by calling the function on the 'students_subset' dataframe.  Print the 'schools_subset' dataframe before and after the invocation of the function to confirm that (1) the 'Default Rank' column was added and (2) that the rank assigned to each row is correct based upon the 'Mean Scaled Score' value. 

### <span style="color:red">Task 29</span>

Now that you have the *add_default_rank* function that works on a dataframe, you can use the *apply* method to apply that method to each group from a groupby.  

Group the full 'schools' dataframe by grade and year, then apply *add_default_rank* to the groups.  Store the results of this into a new 'schools2' dataframe.

Display the resulting 'schools2' dataframe.

### <span style="color:red">Task 30</span>

The 'schools2' now has a rank for each district for each grade and each year.  

To fulfill the Mayor's ranking request, we can now produce the ordered list of districts with the top performers at the top of the list.

To do this, sum the 'Default Rank' column grouping by District and sorting the result using the *sort_values* method with "ascending" set to True and show the results.