## Data cleaning lesson/exercise from [DataQuest](https://www.dataquest.io/m/136/data-cleaning-walkthrough)

Read CSV datasets into Pandas DataFrame objects that we'll store in a dictionary

In [1]:
import pandas as pd

# NYC high school datasets we'll use, from NYC Open Data (https://data.cityofnewyork.us/Education)
data_files = [
    "ap_2010.csv",
    "class_size.csv",
    "demographics.csv",
    "graduation.csv",
    "hs_directory.csv",
    "sat_results.csv"
]

# create a dictionary of data file names, minus .csv extension (keys) to corresponding Pandas DataFrames (values) 
data = {}
for data_file in data_files:
    dataset_name = data_file.split(".")[0]
    data[dataset_name] = pd.read_csv(data_file)

Read survey datasets which are not CSV but instead are tab delimited text files with a "windows-1252" encoding:

In [3]:
# read the survey datasets into Pandas DataFrame objects
all_survey = pd.read_csv("survey_all.txt", 
                         delimiter="\t", 
                         encoding="windows-1252")
d75_survey = pd.read_csv("survey_d75.txt", 
                         delimiter="\t", 
                         encoding="windows-1252")

# combine the two surveys into a single DataFrame object
survey = pd.concat((all_survey, d75_survey), axis=0, sort=True)

of pandas will change to not sort by default.

To accept the future behavior, pass 'sort=True'.


  import sys


Clean the survey DataFrame by 

1. renaming the "dbn" field to "DBN", and 
2. filtering down to only the fields we'll need for our analysis

In [4]:
# copy the "dbn" column as a column named "DBN"
survey["DBN"] = survey["dbn"]

# create a list of the relevant fields we'll want to filter into our "survey" DataFrame
survey_fields = ["DBN", "rr_s", "rr_t", "rr_p", "N_s", "N_t", "N_p", "saf_p_11", "com_p_11", "eng_p_11", "aca_p_11", "saf_t_11", "com_t_11", "eng_t_11", "aca_t_11", "saf_s_11", "com_s_11", "eng_s_11", "aca_s_11", "saf_tot_11", "com_tot_11", "eng_tot_11", "aca_tot_11"]

# filter the DataFrame, assign it into the data dictionary
data["survey"] = survey[survey_fields]