# Week 2 - Data Prep 1
After this week's lesson you should be able to:
- Checking a columns data types and converting types
- Rename a dataframe column 
- Handle missing data: 
    - Filter out missing data
    - Replace values 

This week's lessons are adapted from:
- [PPD599: Advanced Urban Analytics](https://github.com/gboeing/ppd599/tree/main/syllabus)
- [Geo-Python Lesson 5](https://geo-python-site.readthedocs.io/en/latest/notebooks/L5/processing-data-with-pandas.html)

In [None]:
# We are going to start importing the libraries we need
# all in one cell. 
# It is a good practice to keep all the imports in one cell so that
# we can easily see what libraries we are using in the notebook.

import pandas as pd

# If you don't have numpy installed, you can install it via pip
# !pip install numpy in a code cell
import numpy as np

# 1. Data cleaning
As you might have already seen, when we work with data, the initial dataset is not always in a shape where we can use it as is. 

Sometimes column names are misspelled or unclear, there may be missing values, or the format of each column is incorrect. Moreoever you may also have noticed that often we can extract information from columns that might make them easier to work with. All these steps can be considered part of a data cleaning or data wrangling process, where we get the dataset ready to be used more effectively for our analysis purposes. 




## 1.1 Getting the data
Let's say we want to compare the relationship between: 
1. the **total number of students in a general ed public school** 
2. the **money spent on new school construction and improvements in that school**. 

### School Construction Authority

First, make sure you have the `Active_Projects_Under_Construction.csv` in your folder where this notebook is. It's from [Active Projects Under Construction](https://data.cityofnewyork.us/Housing-Development/Active-Projects-Under-Construction/8586-3zfm) from NYC's open data portal, but I've modified it a little.

This is a dataset of new school projects (Capacity) and Capital Improvement Projects (CIP) currently under Construction, created by the School Construction Authority. 




In [None]:
## Here we are going to read a csv directly from the web
## We are going to use the read_csv() function from the pandas library
## 

projects_under_const = pd.read_csv('Active_Projects_Under_Construction.csv')

Also, go ahead and download the data dictionary `SCA Active Projects in Construction Data Dictionary.xlsx`. Data dictionaries often have explanations for what each column name represents and other useful information about the data. 


If you open up the data dictionary, does it correspond to the "Columns in this Dataset" section in the NYC OpenData's page on this dataset? No, right? We have to be careful about these inconsistencies, even in official portals.

Taking a look at the first five rows we can already see there is a lot of missing data in this dataset. 

In [None]:
projects_under_const.head()

### Class size dataset
Also download the `2021_-_2022_Average_Class_Size_by_School.csv` [2021 - 2022 Average Class Size by School](https://data.cityofnewyork.us/Education/2021-2022-Average-Class-Size-by-School/sgr7-hhwp) dataset, along with it's attachments. (Here, only `2021-2022 Average Class Size By School DD.xlsx` is the data dictionary, the other is the dataset as an excel spreadsheet). 


In [None]:
class_size = pd.read_csv('2021_-_2022_Average_Class_Size_by_School.csv')

In [None]:
class_size.head()

Here, most of the columns make sense to me. From the data dictionary, I can see that Program Type is coded as follows:

- General Education (Gen Ed), 
- Integrated Co-Teaching (ICT), 
- Gifted and Talented (G&T), 
- Self-Contained (SC)
- Accelerated (Acc)"


What does not make sense is the `Minimum Class Size` column, which seems to be the same as the maximum class size column in some cases. Therefore, I'll likely not use this column.

## 1.2 Assessing Data Types
One of the next things we'll check is the data type for each column to make sure that they are in the right format. 

In [None]:
class_size.dtypes

I would not necessarily change the data types for all columns (especially when there are a lot), **just the ones that you might potentially need**. 

Here, `Maximum Class Size` is an `object` format (I'm going to ignore `Minimum Class Size` for now), likely because the size is sometimes input as `<INT` and sometimes `INT`. 



## 1.3 Replacing Data

We went over replacing data last week. There are actually a few ways to do this: 
- `df.replace(to_replace=old_value, value=new_value)`


In [None]:
class_size['Grade Level'].unique()

In [None]:
## Warning: inplace=True will modify the original column!
class_size['Grade Level'].replace('K', '0', inplace=True) 

In [None]:
class_size['Grade Level'].unique()

You can also replace multiple values at once

In [None]:
# We should probably not replace 'K-8 SC' with 0, but showing here for demonstration purposes
class_size['Grade Level'].replace(['K','K-8 SC'], '0', inplace=True)

In [None]:
class_size['Grade Level'].unique()

Note that we are not actually changing the original data! Just the version of the data that we have associated with this variable.

We can also use 
* `df.loc[df['column_name'] == some_value, 'column_name'] = new_value`


In [None]:
class_size.loc[class_size['Grade Level'] == 'K','Grade Level'] = '0'    

For replacing null values, see below. 

## 1.3 Changing data types
Notice that we changed everything in "Grade Level" to numbers, but it's still showing up as an `object`.  Now let's try to change the data type for `Grade Level`. 

`.astype()` changes your column types for a particular column. 


In [None]:
class_size.dtypes

In [None]:
## What I've done here is replace the old `max_class_size_clean` column with 
## a version of it that is an int
class_size['Grade Level'] = class_size['Grade Level'].astype(int)

In [None]:
# Notice that `int` from above defaults to 64 bit integers. 
class_size['Grade Level'].dtype

## 1.5 Null values in pandas. 

There are two main ways to represent the absence of values in a cell in Pandas: 
- `None` means a missing entry, but it's not a numeric type. 
- `NaN` is used by Pandas for representing missing data in numeric columns.


In [None]:
projects_under_const.head()

## 1.6 Handling missing data
Now, let's say that our analysis depends knowing the year the data was created. There are a few ways of handling missing data. 

### 1.6.1 Removing rows 
We can remove those rows with data missing from a column that we are planning to use in our analysis. 

In [None]:
projects_under_const[projects_under_const['data_year'].isna()==True]

In [None]:
# Here we are going to use the isna() function to check if the data_year column has a NaN
# isna() returns a boolean (True or False) for each row
# and we are going to use that boolean to filter the dataframe. 
# We are going to keep only the rows where the data_year column is not a NaN

projects_under_const_new = projects_under_const[projects_under_const['data_year'].isna()==False]

### 1.6.2 Replacing missing data
We can also replace the missing data with certain values: 
- We can replace the data with the mean of the non-NaN column values, for numerical values. (For instance, if our columns were something like "adult heights", then replacing the NaN with the mean values in the columns would allow us to leave the sample mean unchanged, which might be good for regression purposes). 
- We can also replace with the median (if you think there are outliers in the sample that might be skewing the mean)
- Replacing with the mode (most frequent value) would make more sense if we think that there's some default value 

**What would you do here?**

In [None]:
# This gets the mode of the data_year column
mode_year = projects_under_const['data_year'].mode()


In [None]:
mode_year

In [None]:
# This fills the NaNs with the mode using the fillna() function
# fillna() is a method that fills in missing values with a value of your choice
projects_under_const['data_year'].fillna(mode_year)


In [None]:
# Now write over the old data_year column with the new one
projects_under_const['data_year'] = projects_under_const['data_year'].fillna(mode_year)

# In-Class Exercise 
Using the `toy_transit.csv` dataset in this repo, identify and address the  missing data issues. 

In [None]:
## insert your code here

# Miniconda (optional)
Some of you may have noticed that Anaconda takes up 3GB. If this is an issue on your computer, and if you have time right now: 

1) Follow [these instructions](https://docs.conda.io/projects/miniconda/en/latest/miniconda-install.html) to download Miniconda, which is a more lightweight Python environment. I think it's about 400 MB. 
2) Once you download miniconda, from your terminal, type 
`conda list`. 
If you get a list of installed packages, you've got conda installed. 
3) Now use the `gds_py_smaller.yml` file (make sure it's in the same directory as your current working directory!) and type

 `conda env create -f gds_py_smaller.yml`