![New York City schoolbus](schoolbus.jpg)

Photo by [Jannis Lucas](https://unsplash.com/@jannis_lucas) on [Unsplash](https://unsplash.com).
<br>

Every year, American high school students take SATs, which are standardized tests intended to measure literacy, numeracy, and writing skills. There are three sections - reading, math, and writing, each with a **maximum score of 800 points**. These tests are extremely important for students and colleges, as they play a pivotal role in the admissions process.

Analyzing the performance of schools is important for a variety of stakeholders, including policy and education professionals, researchers, government, and even parents considering which school their children should attend. 

You have been provided with a dataset called `schools.csv`, which is previewed below.

You have been tasked with answering three key questions about New York City (NYC) public school SAT performance.

| **Column Name**       | **Description**                                                                                                                                                      |
|-----------------------|----------------------------------------------------------------------------------------------------------------------------------------------------------------------|
| **school_name**       | The name of the public school in New York City. This column identifies the specific institution being analyzed for its SAT performance.                              |
| **borough**           | The borough in which the school is located (e.g., Manhattan, Brooklyn, Queens, The Bronx, Staten Island). This column allows for analysis of SAT performance by geographic area. |
| **building_code**     | A unique identifier for the school building. This code is often used for administrative purposes and may refer to specific physical locations or campuses within the school system. |
| **average_math**      | The average SAT score for students in the math section at the school. This score is out of a maximum of 800 points and indicates the school's overall performance in math.   |
| **average_reading**   | The average SAT score for students in the reading section at the school. Similar to the math score, this score is out of a maximum of 800 points and reflects reading performance.  |
| **average_writing**   | The average SAT score for students in the writing section at the school. This score is also out of 800 points and provides insight into the school's writing proficiency.     |
| **percent_tested**    | The percentage of enrolled students who took the SAT at the school. This metric indicates student participation in the testing process and helps assess the representativeness of the reported scores. |


In [None]:
# Re-run this cell 
import pandas as pd

# Read in the data
schools = pd.read_csv("schools.csv")

# Understand the data
schools.info()

In [11]:
schools['percent_tested']=schools['percent_tested'].fillna(schools['percent_tested'].median())
schools_clean = schools.copy()

**SAT scores Analysis**: Focuses on analyzing the scores in different sectors : math, reading and writing. 
1. What is the average for each subject area in New York City?
2. What is the distribution of SAT scores (mean, median, and range) across all boroughs for each subject area (math, reading, writing)?
3. How many schools fall into the high-performance category (defined as schools with an average SAT score above 80% of the maximum possible score, i.e., above 640 in each subject)?
4. What is the average percent tested, and how does it correlate with average SAT scores across math, reading, and writing?
5. Are there significant differences in the percent tested among schools categorized by their performance levels (high-perforamnce, low-performance)? 

In [None]:
schools_clean['total_scores'] = schools_clean['average_math'] + schools_clean['average_reading']+schools_clean['average_writing']
schools_clean['avg_scores'] = schools_clean['total_scores'] /3 
stat_scores = schools_clean[['average_math', 'average_reading', 'average_writing']].describe()
stat_scores

In [None]:
import matplotlib.pyplot as plt

stat_scores.plot(kind='box', figsize=(10, 5))
plt.title('Distributin of average SAT scores in New York schools')
plt.xticks(ticks =[1,2,3], labels = ['Math', 'Reading', 'Writing'])
plt.ylabel('Scores')
plt.show()

In [None]:
high_performance = schools_clean[schools_clean['avg_scores'] > 640]
high_performance_maths = schools_clean[schools_clean['average_math'] > 640]
high_performance_reading = schools_clean[schools_clean['average_reading'] > 640]
high_performance_writing = schools_clean[schools_clean['average_writing'] > 640]
num_high_per_shcools = high_performance.shape[0]
num_high_per_shcools

In [None]:
average_percent_tested = schools_clean['percent_tested'].mean()
plt.figure(figsize=(10, 5))
plt.plot(schools_clean['percent_tested'], schools_clean['avg_scores'], 'o', alpha=0.5)
plt.xlabel('Percentage of students tested')
plt.ylabel('Average SAT scores')
plt.title('Correlation between percentage of students tested and average SAT scores')
plt.show()



In [None]:
schools_clean['high_performing'] = schools_clean['avg_scores'] > 640
schools_clean = schools_clean.sort_values(by = 'avg_scores', ascending = False)
schools_clean_groups = schools_clean.groupby('high_performing')['percent_tested'].mean()
schools_clean_groups.plot(kind='bar', figsize=(10, 5))
plt.xlabel('')
plt.xticks(ticks=[0, 1], labels=['Low performing schools', 'High performing schools'], rotation=0)
plt.title('Comparison of percentage of students tested in high and low performing schools')
plt.ylabel('Percentage of students tested')
plt.show()

1. Which borough has the highest average SAT scores in each subject area?
2. What are the top-performing and lowest-performing schools in each borough based on SAT scores?
3. How does the variability of SAT scores (standard deviation) differ across boroughs for each subject area?
4. Is there a correlation between the percent tested and average SAT scores at the borough level, and does this vary across boroughs?
5. How do schools categorized as high-performing compare to those categorized as low-performing in terms of the average percent tested in each borough?

In [None]:
scores_borough = schools_clean.groupby('borough')[['average_math', 'average_reading', 'average_writing', 'avg_scores', 'percent_tested']].mean()
scores_borough = scores_borough.sort_values(by = 'avg_scores', ascending=False)
scores_borough  = scores_borough.round(2)
scores_borough

1. Which NYC schools have the best math results? 
2. What are the top 10 performing schools based on the combined SAT scores? 
3. Which single borough has the largest standard deviation in the combined SAT score?
- Save your results as a pandas DataFrame called largest_std_dev.
- The DataFrame should contain one row, with:
- **borough** - the name of the NYC borough with the largest standard deviation of "total_SAT".
- **num_schools** - the number of schools in the borough.
- **average_SAT** - the mean of "total_SAT".
- **std_SAT** - the standard deviation of "total_SAT".

In [None]:
best_math_schools = schools[schools['average_math'] > 640]


top_perfoming_schools = schools.copy()
top_perfoming_schools['total_SAT'] = top_perfoming_schools['average_math'] + top_perfoming_schools['average_reading'] + top_perfoming_schools['average_writing']



largest_std_dev = top_perfoming_schools.groupby('borough').agg({'school_name':'count', 'total_SAT':['mean', 'std']})
largest_std_dev.columns = ['num_schools', 'average_SAT', 'std_SAT']
