# *Data Cleaning Walkthrough*

**For the purposes of this project, we'll be using data about New York City public schools, which can be found <a href="https://data.cityofnewyork.us/browse?category=Education">here</a>.<br> Below are the datasets that we will be using:**

<a href="https://data.cityofnewyork.us/Education/SAT-Results/f9bf-2cp4">SAT scores by school</a> - SAT scores for each high school in New York City<br>
<a href="https://data.cityofnewyork.us/Education/School-Attendance-and-Enrollment-Statistics-by-Dis/7z8d-msnt">School attendance</a> - Attendance information for each school in New York City<br>
<a href="https://data.cityofnewyork.us/Education/2010-2011-Class-Size-School-level-detail/urz7-pzb3">Class size</a> - Information on class size for each school<br>
<a href="https://data.cityofnewyork.us/Education/AP-College-Board-2010-School-Level-Results/itfs-ms3e">AP test results</a> - Advanced Placement (AP) exam results for each high school (passing an optional AP exam in a particular subject can earn a student college credit in that subject)<br>
<a href="https://data.cityofnewyork.us/Education/Graduation-Outcomes-Classes-Of-2005-2010-School-Le/vh2h-md7a">Graduation outcomes</a> - The percentage of students who graduated, and other outcome information<br>
<a href="https://data.cityofnewyork.us/Education/School-Demographics-and-Accountability-Snapshot-20/ihfw-zy9j">Demographics</a> - Demographic information for each school<br>
<a href="https://data.cityofnewyork.us/Education/NYC-School-Survey-2011/mnz3-dyi8">School survey</a> - Surveys of parents, teachers, and students at each school<br>

**All of these data sets are interrelated. We'll need to combine them into a single data set before we can find correlations.**


### Reading The Data

1. Read each of the files in the list data_files into a pandas dataframe using the pandas.read_csv() function.
    - Recall that all of the data sets are in the schools folder. That means the path to ap_2010.csv is schools/ap_2010.csv.
2. Add each of the dataframes to the dictionary data, using the base of the filename as the key. For example, you'd enter ap_2010 for the file ap_2010.csv.
3. Afterwards, data should have the following keys:
    - ap_2010
    - class_size
    - demographics
    - graduation
    - hs_directory
    - sat_results
4. In addition, each key in data should have the corresponding dataframe as its value.

In [1]:
import pandas as pd
data_files = [
    "ap_2010.csv",
    "class_size.csv",
    "demographics.csv",
    "graduation.csv",
    "hs_directory.csv",
    "sat_results.csv"
]
data = {}

for item in data_files:
    data[item[:-4]] = pd.read_csv("schools/%s" % item)
    
data

{'ap_2010':         DBN                                         SchoolName  \
 0    01M448                       UNIVERSITY NEIGHBORHOOD H.S.   
 1    01M450                             EAST SIDE COMMUNITY HS   
 2    01M515                                LOWER EASTSIDE PREP   
 3    01M539                     NEW EXPLORATIONS SCI,TECH,MATH   
 4    02M296              High School of Hospitality Management   
 5    02M298                                   Pace High School   
 6    02M300  Urban Assembly School of Design and Construction,   
 7    02M303                         Facing History School, The   
 8    02M305  Urban Assembly Academy of Government and Law, The   
 9    02M308                       Lower Manhattan Arts Academy   
 10   02M400                       HS FOR ENVIRONMENTAL STUDIES   
 11   02M408                       PROFESSIONAL PERFORMING ARTS   
 12   02M411                           BARUCH COLLEGE CAMPUS HS   
 13   02M412                       NYC LAB HS FOR C

### Exploring the SAT Data 
1. Display the first five rows of the SAT scores data.
    - Use the key sat_results to access the SAT scores dataframe stored in the dictionary data.
    - Use the pandas.DataFrame.head() method along with the print() function to display the first five rows of the dataframe.

In [2]:
data['sat_results'][:5]

Unnamed: 0,DBN,SCHOOL NAME,Num of SAT Test Takers,SAT Critical Reading Avg. Score,SAT Math Avg. Score,SAT Writing Avg. Score
0,01M292,HENRY STREET SCHOOL FOR INTERNATIONAL STUDIES,29,355,404,363
1,01M448,UNIVERSITY NEIGHBORHOOD HIGH SCHOOL,91,383,423,366
2,01M450,EAST SIDE COMMUNITY SCHOOL,70,377,402,370
3,01M458,FORSYTH SATELLITE ACADEMY,7,414,401,359
4,01M509,MARTA VALLE HIGH SCHOOL,44,390,433,384


### Exploring the Remaining Data
1. Loop through each key in data. For each key:
    - Display the first five rows of the dataframe associated with the key.

In [3]:
for item in data:
    print('\n%s\n' % item)
    print(data[item][:5])
    


ap_2010

      DBN                             SchoolName AP Test Takers   \
0  01M448           UNIVERSITY NEIGHBORHOOD H.S.              39   
1  01M450                 EAST SIDE COMMUNITY HS              19   
2  01M515                    LOWER EASTSIDE PREP              24   
3  01M539         NEW EXPLORATIONS SCI,TECH,MATH             255   
4  02M296  High School of Hospitality Management               s   

  Total Exams Taken Number of Exams with scores 3 4 or 5  
0                49                                   10  
1                21                                    s  
2                26                                   24  
3               377                                  191  
4                 s                                    s  

class_size

   CSD BOROUGH SCHOOL CODE                SCHOOL NAME GRADE  PROGRAM TYPE  \
0    1       M        M015  P.S. 015 Roberto Clemente     0K       GEN ED   
1    1       M        M015  P.S. 015 Roberto Clemente     0K

### Reading in the Survey Data

1. Read in survey_all.txt.
    - Use the pandas.read_csv() function to read survey_all.txt into the variable all_survey. Recall that this file is located in the schools folder.
        - Specify the keyword argument delimiter="\t".
        - Specify the keyword argument encoding="windows-1252".
2. Read in survey_d75.txt.
    - Use the pandas.read_csv() function to read schools/survey_d75.txt into the variable d75_survey. Recall that this file is located in the schools folder.
        - Specify the keyword argument delimiter="\t".
        - Specify the keyword argument encoding="windows-1252".
3. Combine d75_survey and all_survey into a single dataframe.
    - Use the pandas concat() function with the keyword argument axis=0 to combine d75_survey and all_survey into the dataframe survey.
    - Pass in all_survey first, then d75_survey when calling the pandas.concat() function.
4. Display the first five rows of survey using the pandas.DataFrame.head() function.

In [16]:
all_survey = pd.read_csv("schools/survey_all.txt", encoding ='Windows-1252', delimiter='\t')
d75_survey = pd.read_csv("schools/survey_d75.txt", encoding ='Windows-1252', delimiter='\t')
survey = pd.concat([all_survey, d75_survey], axis=0)
survey.head()

of pandas will change to not sort by default.

To accept the future behavior, pass 'sort=False'.


  This is separate from the ipykernel package so we can avoid doing imports until


Unnamed: 0,N_p,N_s,N_t,aca_p_11,aca_s_11,aca_t_11,aca_tot_11,bn,com_p_11,com_s_11,...,t_q8c_1,t_q8c_2,t_q8c_3,t_q8c_4,t_q9,t_q9_1,t_q9_2,t_q9_3,t_q9_4,t_q9_5
0,90.0,,22.0,7.8,,7.9,7.9,M015,7.6,,...,29.0,67.0,5.0,0.0,,5.0,14.0,52.0,24.0,5.0
1,161.0,,34.0,7.8,,9.1,8.4,M019,7.6,,...,74.0,21.0,6.0,0.0,,3.0,6.0,3.0,78.0,9.0
2,367.0,,42.0,8.6,,7.5,8.0,M020,8.3,,...,33.0,35.0,20.0,13.0,,3.0,5.0,16.0,70.0,5.0
3,151.0,145.0,29.0,8.5,7.4,7.8,7.9,M034,8.2,5.9,...,21.0,45.0,28.0,7.0,,0.0,18.0,32.0,39.0,11.0
4,90.0,,23.0,7.9,,8.1,8.0,M063,7.9,,...,59.0,36.0,5.0,0.0,,10.0,5.0,10.0,60.0,15.0


**Here, we'll need filter the columns to remove the ones we don't need. Luckily, there's a <a href="https://data.cityofnewyork.us/api/views/mnz3-dyi8/files/aa68d821-4dbb-4eb2-9448-3d8cbbad5044?download=true&filename=Survey%20Data%20Dictionary.xls">data dictionary</a> at the original data download <a href="https://data.cityofnewyork.us/Education/2010-2011-NYC-School-Survey/mnz3-dyi8">location</a>. The dictionary tells us what each column represents. Based on our knowledge of the problem and the analysis we're trying to do, we can use the data dictionary to determine which columns to use.<br><br>Based on the dictionary, it looks like these are the relevant columns:**

***[*** *"dbn", "rr_s", "rr_t", "rr_p", "N_s", "N_t", "N_p", "saf_p_11", "com_p_11", "eng_p_11", "aca_p_11", "saf_t_11", "com_t_11", "eng_t_11", "aca_t_11", "saf_s_11", "com_s_11", "eng_s_11", "aca_s_11", "saf_tot_11", "com_tot_11", "eng_tot_11", "aca_tot_11"* ***]***

### Cleaning up the Surveys


1. Copy the data from the dbn column of survey into a new column in survey called DBN.
2. Filter survey so it only contains the columns we listed above. You can do this using pandas.DataFrame.loc[].
    - Remember that we renamed dbn to DBN; be sure to change the list of columns we want to keep accordingly.
3. Assign the dataframe survey to the key survey in the dictionary data.
4. When you're finished, the value in data["survey"] should be a dataframe with 23 columns and 1702 rows.