# Welcome to my notebook!
### If you think this notebook is heading somewhere and can be meaningful to others then upvote it! I will like to have many people look at it and share their point of view. You don't get better by just being told you are doing a good job, but rather by suggestions and criticism.
### Thank you!

## Glossary

- <a href='#introduction'>1 <b>Introduction:</b></a>
- <a href='#importing'>2 <b>Importing and installing dependencies:</b></a>
- <a href='#overview'>3 <b>Overview:</b></a>
    - <a href='#columns'>3.1 Columns</a>
    - <a href='#missingValues'>3.2 Missing Values</a>
- <a href='#schoolExplorer'>4 <b>School Explorer:</b></a>
    - <a href='#schoolRegions'>4.1 School Regions</a>
- <a href='#brooklyn'>5 <b>Brooklyn</b></a>
    - <a href='#brooklynOverview'>5.1 Brooklyn overview:</a>
- <a href='#bronx'>6 <b>Bronx:</b></a>
- <a href='#newYork'>7 <b>New York:</b></a>
- <a href='#statenIsland'>8 <b>Staten Island:</b></a>

## <a id='introduction'>1 <b>Introduction:</b></a>

PASSNYC is a not-for-profit organization that facilitates a collective impact that is dedicated to broadening educational opportunities for New York City's talented and underserved students. New York City is home to some of the most impressive educational institutions in the world, yet in recent years, the City’s specialized high schools - institutions with historically transformative impact on student outcomes - have seen a shift toward more homogeneous student body demographics.<br>

PASSNYC uses public data to identify students within New York City’s under-performing school districts and, through consulting and collaboration with partners, aims to increase the diversity of students taking the Specialized High School Admissions Test (SHSAT). By focusing efforts in under-performing areas that are historically underrepresented in SHSAT registration, we will help pave the path to specialized high schools for a more diverse group of students.

## <a id='importing'>2 <b>Importing and installing dependencies:</b></a>

In [None]:
import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)
from sklearn.model_selection import cross_val_score, train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import LinearRegression, RidgeCV, LassoCV, ElasticNetCV
from sklearn.metrics import mean_squared_error, make_scorer
from scipy.stats import skew
from IPython.display import display
import lightgbm as lgb
import matplotlib.pyplot as plt
import seaborn as sns
%matplotlib inline

# Limiting floats output to 3 decimal points:
pd.set_option('display.float_format', lambda x: '{:.3f}'.format(x)) 

# Adjusting the displays of the dataset (for some reason, I had to save the data in a variable before pd would allow me to change the display options. Let me know if there is a way around this.):
# pd.set_option('display.height', 1000)
# pd.set_option('display.max_rows', 500)
# pd.set_option('display.max_columns', 500)
# pd.set_option('display.width', 1000)

print('Dependencies installed!')

## <a id='overview'>3 <b>Overview of dataset:</b></a>

In [None]:
school_explorer = pd.read_csv('../input/data-science-for-good/2016 School Explorer.csv')
registration_testers = pd.read_csv('../input/data-science-for-good/D5 SHSAT Registrations and Testers.csv')

In [None]:
school_explorer.head()

In [None]:
registration_testers.head()

In [None]:
print(school_explorer.shape)
school_explorer.describe()

In [None]:
print(registration_testers.shape)
registration_testers.describe()

### <a id='columns'>3.1 <b>Columns:</b></a>
A preliminary overview of school_explorer:
- Separate the different schools by region (Manhattan, Brooklyn, Bronx, Queens, Staten Island and many more).
    - Which regions performs best?
    - Does economic wealth of region help performance?
    - Where are the majority of schools? Is there a reason for this?
- Grades provided by schools (PK, 01, 02, 03, 04, 05, 06, 07, 08, 09, 10, 11, 12)
    - Which grade is the most popular?
    - Why are some grades not provided by some schools?
- Community school or not.
    - Performance between a community and not?
- Estimated income of school.
    - Does income affect the performance of the school and students?
- The ethnic background of students.
    - The relation between ethnic background?
- Performance of teachers, leadership, environment, family-community.
    - Does the staff have a big impact on the performance of the students?
- Trust of school by peers.
    - Does trust has a big impact on the performance of the students?
- Performance of students
    - ELA Proficiency.
    - MATH Proficiency.
- Grades 3 to 8. 
    - Why only these 6 grades?

In [None]:
school_explorer.columns.tolist()

A preliminary overview of registration_testers:
- Year of SHST
- Grade level
- Enrollment
- Total number of registered for the SHSAT
- The total number that took the SHSAT

In [None]:
registration_testers.columns.tolist()

### <a id='missingValues'>3.2 <b>Missing Values:</b></a>
- school_explorer has some missing values:
    - Majority of values in:
        - Other Location Code in LCGMS, Adjusted Grade, New?
        - Do we need these columns?
    - Small/Medium amount of values missing:
        - School Income Estimate is missing 31% of the data. This can make a conflict when deciding whether or not funding for schools affects the overall performance.
        - Most of the other columns that are missing values range between 1% - 6%, which is not a very significant value. We might be able to fill those values with averages or use them as they are.

In [None]:
total = school_explorer.isnull().sum().sort_values(ascending=False)
percent = ((school_explorer.isnull().sum() / school_explorer.isnull().count()) * 100).sort_values(ascending=False)
missing_values = pd.DataFrame({'Total ': total, 'Missing ratio ': percent})
missing_values.head(25)

No missing values in registration_testers

In [None]:
total = registration_testers.isnull().sum().sort_values(ascending=False)
percent = ((registration_testers.isnull().sum() / registration_testers.isnull().count()) * 100).sort_values(ascending=False)
missing_values = pd.DataFrame({'Total ': total, 'Missing ratio ': percent})
missing_values.head(10)

## <a id='schoolExplorer'>4 <b>School Explorer:</b></a>

### <a id='schoolRegions'>4.1 <b>School Regions</b></a>
Surprisingly, Brooklyn seems to have the most schools followed by the Bronx, New York and Staten Island. After that, the cities tend to have an average distribution amongst each other.
This raises the question of why Brooklyn has more schools in comparison to the Bronx or New York. Let's analyze the top 4 cities since they display the most difference in the dataset.

In [None]:
total = pd.DataFrame(school_explorer['City'].value_counts().reset_index())
total.columns = ['city', 'total']

plt.figure(figsize=(20, 10))

barplot = sns.barplot(x=total['total'], y=total['city'])
barplot.set(xlabel='', ylabel='')
plt.title('Total of schools in each city:', fontsize=20)
plt.xticks(fontsize=17)
plt.yticks(fontsize=10)
plt.show()

## <a id='brooklyn'>5 <b>Brooklyn:</b></a>

### <a id='brooklynOverview'>5.1 Brooklyn overview:</a>
A preliminary overview of brooklyn_census:
- Population of Brooklyn is above 2.5 million.
- Has an estimated increase of 5.8%.
- 22.9% of the populatiin is under the age of 18.
    - We can analyze whether or not a good percentage attends schools.
- Ethnicity:
    - 49% of the population is white.
    - 34.3% is Black or African American
- Median income $50,640
- Poverty level 20.6%

In [None]:
# Census downloaded from: https://www.census.gov/quickfacts/fact/table/kingscountybrooklynboroughnewyork/IPE120216
brooklyn_census = pd.read_csv('../input/brooklyn-census/QuickFacts Jun-27-2018 (1).csv') 

# Dropping the second and last column since it does not have any values.
brooklyn_census = brooklyn_census.drop('Fact Note', axis=1)
brooklyn_census = brooklyn_census.drop('Value Note for Kings County (Brooklyn Borough), New York', axis=1)

# Renaming the 'Kings County (Brooklyn Borough), New York' for easier use:
brooklyn_census.columns = ['fact', 'total']
brooklyn_census.head(60)

In [None]:
# Saving only Brooklyn schools in a variable:
brooklyn = pd.DataFrame(school_explorer[school_explorer['City'] == 'BROOKLYN'])
print(brooklyn.shape)
brooklyn.head()

In [None]:
nan

In [None]:
nan

In [None]:
nan

In [None]:
nan

In [None]:
nan

In [None]:
nan

In [None]:
nan

In [None]:
nan