In [8]:
# We are going to start importing the libraries we need
# all in one cell. 
# It is a good practice to keep all the imports in one cell so that
# we can easily see what libraries we are using in the notebook.

import pandas as pd


# Students and capital improvements
We are going to continue with the datasets that we worked on earlier this week. Again, our objective is to look at the relationship between the **total number of students in a general ed public school** to the **money spent on new school construction and improvements in that school**. 

# 0. Read in data

In [9]:
projects_under_const = pd.read_csv('Active_Projects_Under_Construction.csv')
# Let's pretend we don't have the 'data_year' column, which wasn't in the original dataset anyways
projects_under_const = projects_under_const.drop(columns='data_year')

class_size = pd.read_csv('2021_-_2022_Average_Class_Size_by_School.csv')

## 1.1 Slicing Strings

### 1.1.1 Example 1


The `projects_under_const` has a `Data as Of` column, which gives us some temporal variation in when, at least the data was added to the table. It could be useful, for instance, if we think that `Data as Of` is a rough proxy for when the project was funded or approved. 

In [12]:
projects_under_const.head()

Unnamed: 0,School Name,BoroughCode,Geographical District,Project Description,Construction Award,Project type,Building ID,Building Address,City,Postcode,...,Latitude,Longitude,Community Board,Council District,Census Tract,BIN,BBL,NTA,Location 1,Data As Of
0,,M,2,,0.0,CAP,M777,227 WEST 27TH STREET,Manhattan,,...,,,,,,,,,,
1,BAYSIDE HIGH SCHOOL - QUEENS,Q,26,FY19 RESO A AUDITORIUM UPGRADE,1261000.0,CIP,Q405,32-24 CORPORAL KENNEDY STREET,Queens,10301.0,...,,,,,,,,,,1/6/22
2,P.S. @ PARCEL F - QUEENS,Q,30,Demo,0.0,CAP,,2ND STREET BETWEEN 56TH AND 57TH AVENUE,Queens,11101.0,...,,,,,,,,,,10/30/18
3,3K CENTER @ 3893 DYRE AVENUE - BRONX,X,11,Lease,6262000.0,CAP,X501,3893 DYRE AVEUNE,Bronx,,...,,,,,,,,,,8/4/22
4,P.S. 129 - QUEENS,Q,25,Addition,0.0,CAP,Q129,128-02 7TH AVENUE,Queens,11356.0,...,40.790638,-73.839771,7.0,19.0,945.0,4096774.0,4039760000.0,Whitestone,"(40.790638, -73.839771)",2/6/19


In [13]:
# Remember that NaN means "Not a Number".
# In other words, it is a missing value
projects_under_const['Data As Of'].head()

0         NaN
1      1/6/22
2    10/30/18
3      8/4/22
4      2/6/19
Name: Data As Of, dtype: object

Let's say we want to extract year from these dates. We have another string-related function we can apply to all of our values under `Data As Of`. 

`.split()` splits strings around given separator/delimiter to create a list of strings. 

Here, we will use `/` as our separator. 

In [14]:
# "str" is a string method that allows us to apply a suite of methods/functions for strings to a column 
projects_under_const['Data As Of'].str.split('/')

0                NaN
1         [1, 6, 22]
2       [10, 30, 18]
3         [8, 4, 22]
4         [2, 6, 19]
            ...     
8996     [11, 2, 22]
8997     [11, 2, 22]
8998     [11, 2, 22]
8999     [11, 2, 22]
9000     [11, 2, 22]
Name: Data As Of, Length: 9001, dtype: object

Now we just have to get the last value (where it exists) and create a new column with the year. 

Here the [-1] is used to get the last element of the list, which we are applying to each element of the column

In [15]:
projects_under_const['Data As Of'].str.split('/').str[-1]

0       NaN
1        22
2        18
3        22
4        19
       ... 
8996     22
8997     22
8998     22
8999     22
9000     22
Name: Data As Of, Length: 9001, dtype: object

Now, let's create a new column called `data_year` with our newly extracted year values. 

In [16]:

projects_under_const['data_year'] = projects_under_const['Data As Of'].str.split('/').str[-1]

In [17]:
# Notice that when there was an NaN, the split function returned a NaN
projects_under_const.head()

Unnamed: 0,School Name,BoroughCode,Geographical District,Project Description,Construction Award,Project type,Building ID,Building Address,City,Postcode,...,Longitude,Community Board,Council District,Census Tract,BIN,BBL,NTA,Location 1,Data As Of,data_year
0,,M,2,,0.0,CAP,M777,227 WEST 27TH STREET,Manhattan,,...,,,,,,,,,,
1,BAYSIDE HIGH SCHOOL - QUEENS,Q,26,FY19 RESO A AUDITORIUM UPGRADE,1261000.0,CIP,Q405,32-24 CORPORAL KENNEDY STREET,Queens,10301.0,...,,,,,,,,,1/6/22,22.0
2,P.S. @ PARCEL F - QUEENS,Q,30,Demo,0.0,CAP,,2ND STREET BETWEEN 56TH AND 57TH AVENUE,Queens,11101.0,...,,,,,,,,,10/30/18,18.0
3,3K CENTER @ 3893 DYRE AVENUE - BRONX,X,11,Lease,6262000.0,CAP,X501,3893 DYRE AVEUNE,Bronx,,...,,,,,,,,,8/4/22,22.0
4,P.S. 129 - QUEENS,Q,25,Addition,0.0,CAP,Q129,128-02 7TH AVENUE,Queens,11356.0,...,-73.839771,7.0,19.0,945.0,4096774.0,4039760000.0,Whitestone,"(40.790638, -73.839771)",2/6/19,19.0


### 1.1.2 Example 2
We will eventually be comparing school attendance characteristics to money allocated through **merging along a common column name** at the **school level**.

What are our options for merging here? Let's take  look. 

In [18]:
class_size.head()

Unnamed: 0,DBN,School Name,Grade Level,Program Type,Number of Students,Number of Classes,Average Class Size,Minimum Class Size,Maximum Class Size
0,01M015,PS 015 ROBERTO CLEMENTE,K,G&T,13,1,13.0,<15,<15
1,01M015,PS 015 ROBERTO CLEMENTE,K,ICT,17,1,17.0,17,17
2,01M015,PS 015 ROBERTO CLEMENTE,1,G&T,8,1,8.0,<15,<15
3,01M015,PS 015 ROBERTO CLEMENTE,1,ICT,18,1,18.0,18,18
4,01M015,PS 015 ROBERTO CLEMENTE,2,G&T,8,1,8.0,<15,<15


In [19]:
projects_under_const.head()

Unnamed: 0,School Name,BoroughCode,Geographical District,Project Description,Construction Award,Project type,Building ID,Building Address,City,Postcode,...,Longitude,Community Board,Council District,Census Tract,BIN,BBL,NTA,Location 1,Data As Of,data_year
0,,M,2,,0.0,CAP,M777,227 WEST 27TH STREET,Manhattan,,...,,,,,,,,,,
1,BAYSIDE HIGH SCHOOL - QUEENS,Q,26,FY19 RESO A AUDITORIUM UPGRADE,1261000.0,CIP,Q405,32-24 CORPORAL KENNEDY STREET,Queens,10301.0,...,,,,,,,,,1/6/22,22.0
2,P.S. @ PARCEL F - QUEENS,Q,30,Demo,0.0,CAP,,2ND STREET BETWEEN 56TH AND 57TH AVENUE,Queens,11101.0,...,,,,,,,,,10/30/18,18.0
3,3K CENTER @ 3893 DYRE AVENUE - BRONX,X,11,Lease,6262000.0,CAP,X501,3893 DYRE AVEUNE,Bronx,,...,,,,,,,,,8/4/22,22.0
4,P.S. 129 - QUEENS,Q,25,Addition,0.0,CAP,Q129,128-02 7TH AVENUE,Queens,11356.0,...,-73.839771,7.0,19.0,945.0,4096774.0,4039760000.0,Whitestone,"(40.790638, -73.839771)",2/6/19,19.0


Even though there is a **School Name** column in both datasets, the format seems to be quite different. 
- For the `projects_under_const` dataset, the school names are all over the place. Some are the name and borough separated by a `-`, some also include an `@` followed by a rough locationn. 
- For the `class_size` df, the school names are consistent, but we can see that it might be a pain to match the two. 

In [20]:
projects_under_const['School Name']

0                                        NaN
1               BAYSIDE HIGH SCHOOL - QUEENS
2                   P.S. @ PARCEL F - QUEENS
3       3K CENTER @ 3893 DYRE AVENUE - BRONX
4                          P.S. 129 - QUEENS
                        ...                 
8996                     P.S. 236 - BROOKLYN
8997                        P.S. 277 - BRONX
8998                       P.S. 5 - BROOKLYN
8999                        P.S. 182 - BRONX
9000                        I.S. 127 - BRONX
Name: School Name, Length: 9001, dtype: object

In [21]:
class_size['School Name']

0                              PS 015 ROBERTO CLEMENTE
1                              PS 015 ROBERTO CLEMENTE
2                              PS 015 ROBERTO CLEMENTE
3                              PS 015 ROBERTO CLEMENTE
4                              PS 015 ROBERTO CLEMENTE
                             ...                      
12440                  PS 377 ALEJANDRINA B DE GAUTIER
12441                        JHS 383 PHILIPPA SCHUYLER
12442                        JHS 383 PHILIPPA SCHUYLER
12443                      PS /IS 384 FRANCES E CARTER
12444    EVERGREEN MIDDLE SCHOOL FOR URBAN EXPLORATION
Name: School Name, Length: 12445, dtype: object

Instead, I noticed that there's a `Building ID` column in the `projects_under_constr` DF (dataframe, for short) that, though is described unhelpfully as "ID of the Building" in the documentation, looks to be similar to the `DBN` from `class_size` DF. 


In fact, when I look at what `DBN` is in the class size documentation, it says that this column "Denotes cocatenation[sic] of district, borough and three digit school number."

I'm going to guess here that if I extract the "borough and three digit school number" part of `DBN`, this will match my `Building ID` column. 

Thankfully, it seems like there is a fixed number of characters I need extract from `DBN`: 
- Borough = 1
- School number = 3

In total, I will need the last 4 characters from `DBN`. We'll do this again with a string splice. 

In [22]:
# Here I am going to use the str method to get the last 4 characters of the DBN
# within the square brackets, I am taking everything fourth from the end onwards
# That's what -4 means

class_size['DBN'].str[-4:]

0        M015
1        M015
2        M015
3        M015
4        M015
         ... 
12440    K377
12441    K383
12442    K383
12443    K384
12444    K562
Name: DBN, Length: 12445, dtype: object

Quick review of selecting ranges:

In [23]:
# It's a little strange because backwards counting starts at -1
class_size['DBN'].str[-1:]

0        5
1        5
2        5
3        5
4        5
        ..
12440    7
12441    3
12442    3
12443    4
12444    2
Name: DBN, Length: 12445, dtype: object

In [24]:
## Here, 4: means that I want to start at the fifth character 
## because python starts counting at 0 for forward counting
class_size['DBN'].str[4:]


0        15
1        15
2        15
3        15
4        15
         ..
12440    77
12441    83
12442    83
12443    84
12444    62
Name: DBN, Length: 12445, dtype: object

In [25]:
## And if I wanted to select a slice of the string in the middle
## I can do the following
class_size['DBN'].str[1:4]

0        1M0
1        1M0
2        1M0
3        1M0
4        1M0
        ... 
12440    2K3
12441    2K3
12442    2K3
12443    2K3
12444    2K5
Name: DBN, Length: 12445, dtype: object

Back to our exericse, let's assign our slice to a new colunn called `bid`

In [26]:
class_size['bid'] = class_size['DBN'].str[-4:]

In [None]:
class_size.head()

## 1.3 Aggregating data: Split-apply-combine
The split-apply-combine operation that is very common in pandas. We often want to aggregate data by some category. For example, we might want to know the total amount of construction money allocated by school. Or we might want to know the total number of students in each school.

For the projects under construction, let's group by the `Building ID`, which is our index for school here. and sum all the award amounts by school to get the: 
- Total construction award amount per school

In [31]:
projects_under_const.groupby('Building ID').count()

Unnamed: 0_level_0,School Name,BoroughCode,Geographical District,Project Description,Construction Award,Project type,Building Address,City,Postcode,Borough,...,Longitude,Community Board,Council District,Census Tract,BIN,BBL,NTA,Location 1,Data As Of,data_year
Building ID,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
K001,11,11,11,11,11,11,11,11,10,11,...,10,10,10,10,10,10,10,10,11,11
K002,9,9,9,9,9,9,9,9,8,9,...,8,8,8,8,8,8,8,8,9,9
K003,4,4,4,4,4,4,4,4,4,4,...,4,4,4,4,4,4,4,4,3,3
K005,4,4,4,4,4,4,4,4,4,4,...,4,4,4,4,4,4,4,4,4,4
K007,9,9,9,9,9,9,9,9,7,9,...,7,7,7,7,7,7,7,7,9,9
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
X843,17,17,17,17,17,17,17,17,15,17,...,15,15,15,15,15,15,15,15,16,16
X862,3,3,3,3,3,3,3,3,3,3,...,3,3,3,3,3,3,3,3,3,3
X930,5,5,5,5,5,5,5,5,4,5,...,4,4,4,4,4,4,4,4,5,5
X970,1,1,1,1,1,1,1,1,1,1,...,1,1,1,1,1,1,1,1,1,1


In [None]:
## Remember that .sum() will only sum the numeric columns
projects_under_const.groupby('Building ID').sum()

Most of these columns are gibberish after we sum (for ex: we don't need a sum of latitudes and longitudes by school). Let's just select the columns we want to use: 

In [None]:
# Remeber the brackets after a DF allow you to select columns
projects_under_const.groupby('Building ID').sum()['Construction Award']

Let's assign this to a new variable name. 

In [None]:
projects_under_const_agg = projects_under_const.groupby('Building ID').sum()['Construction Award']

Here you can see that the result is a **pandas Series**. To make this easier to work with during the merge, let's transform this into a pandas DF. 

I'm going to use a function call `.reset_index()` as a trick to do this. `.reset_index()` is a method that resets the index of a dataframe to a column of your choice. The default is to reset the index to a column of sequential numbers

In [None]:
# See how Building ID, which was the index before, is now a column. 
# and the index is i just 0,...,1180

projects_under_const_agg.reset_index()

In [None]:
projects_under_const_agg = projects_under_const_agg.reset_index()

In [None]:
projects_under_const_agg

Let's do something similar with the `class_size` df. As we can see from the below, our data is likely one row per grade and program. We want to aggregate this to the school level. 

In [None]:
class_size.head()

I'm first going to filter my DF since I just want 'Gen Ed' in order not to skew the representative class size by special programs. 

In [None]:
# .unique() returns a list of all the unique values in a column
class_size['Program Type'].unique()

In [None]:
# I am going to use the == operator to check if the value in the Program Type column is equal to 'Gen Ed'
# Then we'll set this filtered dataframe to a new variable
# and use that new dataframe from now on. 
class_size_new = class_size[class_size['Program Type']=='Gen Ed']

In [None]:
class_size_new.head()

Now let's groupby `bid` and sum all the grades within each school. 

In [None]:
class_size_new.groupby('bid').sum()

Again, we'll just need the `Number of Students` column here. And I'm going to do the `reset_index()` trick again. This time, I'm going to string all these steps together

In [None]:
# Pandas reads this code from left to right and will apply each function on the right to the everything on the left
# So, first we are going to group by bid
# Then we are going to sum each group
# Then from the entire summed dataframe, we are going to select the total_students_in_grade column
# Selecting that series, we are going to reset the index to create our new dataframe. .

class_size_new_agg = class_size_new.groupby('bid').sum()['Number of Students'].reset_index()

In [None]:
class_size_new_agg.head()

## 1.4 Merging dataframes
Lastly, we want to do the merge part using the `.merge()` function.

It follows this format: 
```
df1.merge(df2,left_on='df1 col for merging', right_on='df2 col for merge')
```

By default, the type of merge will be **inner**, however, here are other types: 
- ‘left’, ‘right’, ‘outer’, ‘inner’, ‘cross’

 We are going to merge 
- `projects_under_cont_agg`
- `class_size_new_agg`

In [None]:
## Here, I'm doing an inner merge, which means that I am only going to keep the rows that have a match in both dataframes
## This means only the schools that received a construction award are going to be kept
class_size_new_agg.merge(projects_under_const_agg,
                         left_on='bid',
                         right_on='Building ID')

In [None]:
merged_df = class_size_new_agg.merge(projects_under_const_agg,
                         left_on='bid',
                         right_on='Building ID', 
                         how='left')

Ok, finally, to get to our answer, we're going to apply the `.corr()` function to our dataframe. The [documentation](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.corr.html) tells us that this function computes pairwise correlation of columns, excluding NA/null values.

The default method is a 'Pearson' correlation, with all methods being: 
- pearson : standard correlation coefficient
- kendall : Kendall Tau correlation coefficient
- spearman : Spearman rank correlation

In [None]:
# Yikes, 0.154657 correlation. I guess I assumed wrong that there would be an strong correlation between the number of students in a school and the amount of money spent on construction.
merged_df.corr()

Well, that wasn't the strong relationship I expected. This doesn't necessarily mean there's no relationship, but I'm going to stop my investigation here for now. 

### 1.4.1 Merging with pd.concat

In [36]:
data_q1 = {
    'Product_ID': ['P001', 'P002', 'P003', 'P004'],
    'Q1_Sales': [250, 150, 200, 300]
}
df_q1 = pd.DataFrame(data_q1)

data_q2 = {
    'Product_ID': ['P001', 'P002', 'P003', 'P004'],
    'Q2_Sales': [260, 110, 210, 310]
}
df_q2 = pd.DataFrame(data_q2)

data_q3 = {
    'Product_ID': ['P001', 'P002', 'P003', 'P004'],
    'Q3_Sales': [270, 120, 220, 320]
}
df_q3 = pd.DataFrame(data_q3)


In [39]:
df_q1

Unnamed: 0,Product_ID,Q1_Sales
0,P001,250
1,P002,150
2,P003,200
3,P004,300


In [40]:
df_q2

Unnamed: 0,Product_ID,Q2_Sales
0,P001,260
1,P002,110
2,P003,210
3,P004,310


In [41]:
df_q3

Unnamed: 0,Product_ID,Q3_Sales
0,P001,270
1,P002,120
2,P003,220
3,P004,320


In [42]:
pd.concat([df_q1,df_q2,df_q3])          

Unnamed: 0,Product_ID,Q1_Sales,Q2_Sales,Q3_Sales
0,P001,250.0,,
1,P002,150.0,,
2,P003,200.0,,
3,P004,300.0,,
0,P001,,260.0,
1,P002,,110.0,
2,P003,,210.0,
3,P004,,310.0,
0,P001,,,270.0
1,P002,,,120.0


In [45]:
pd.concat([df_q1,df_q2,df_q3],axis=0)          

Unnamed: 0,Product_ID,Q1_Sales,Q2_Sales,Q3_Sales
0,P001,250.0,,
1,P002,150.0,,
2,P003,200.0,,
3,P004,300.0,,
0,P001,,260.0,
1,P002,,110.0,
2,P003,,210.0,
3,P004,,310.0,
0,P001,,,270.0
1,P002,,,120.0


In [44]:
pd.concat([df_q1,df_q2,df_q3],axis=1)   

Unnamed: 0,Product_ID,Q1_Sales,Product_ID.1,Q2_Sales,Product_ID.2,Q3_Sales
0,P001,250,P001,260,P001,270
1,P002,150,P002,110,P002,120
2,P003,200,P003,210,P003,220
3,P004,300,P004,310,P004,320


In [46]:
pd.concat([df_q1,df_q2,df_q3],axis=1)

Unnamed: 0,Product_ID,Q1_Sales,Product_ID.1,Q2_Sales,Product_ID.2,Q3_Sales
0,P001,250,P001,260,P001,270
1,P002,150,P002,110,P002,120
2,P003,200,P003,210,P003,220
3,P004,300,P004,310,P004,320


## 1.5 Parsing dates
There is a datetime data type in Pandas that allows us to turn columns with dates and date-times into a `datetime` type. It uses the function 

`pd.to_datetime(df['datetime column])`

Let's try that: 

In [47]:
projects_under_const['Data As Of']

0            NaN
1         1/6/22
2       10/30/18
3         8/4/22
4         2/6/19
          ...   
8996     11/2/22
8997     11/2/22
8998     11/2/22
8999     11/2/22
9000     11/2/22
Name: Data As Of, Length: 9001, dtype: object

In [48]:
pd.to_datetime(projects_under_const['Data As Of'])

0             NaT
1      2022-01-06
2      2018-10-30
3      2022-08-04
4      2019-02-06
          ...    
8996   2022-11-02
8997   2022-11-02
8998   2022-11-02
8999   2022-11-02
9000   2022-11-02
Name: Data As Of, Length: 9001, dtype: datetime64[ns]

Pretty easy! In the background, Pandas is inferring what your date-time format is. You can also state this more explicitly: 

In [49]:
## Lower-case "y" means the year is represented by the last two digits
## Upper-case "Y" means the year is represented by the entire year
## So, if we have 2021, we should use "Y"
## If we have 21, we should use "y"

pd.to_datetime(projects_under_const['Data As Of'], format='%m/%d/%y')

0             NaT
1      2022-01-06
2      2018-10-30
3      2022-08-04
4      2019-02-06
          ...    
8996   2022-11-02
8997   2022-11-02
8998   2022-11-02
8999   2022-11-02
9000   2022-11-02
Name: Data As Of, Length: 9001, dtype: datetime64[ns]

Now we can create a new column from this.

In [50]:
projects_under_const['data_date_new'] = pd.to_datetime(projects_under_const['Data As Of'], format='%m/%d/%y')

Now we can access date characteristics. 

In [52]:
projects_under_const['data_date_new'].dt.year

0          NaN
1       2022.0
2       2018.0
3       2022.0
4       2019.0
         ...  
8996    2022.0
8997    2022.0
8998    2022.0
8999    2022.0
9000    2022.0
Name: data_date_new, Length: 9001, dtype: float64

In [53]:
projects_under_const['data_date_new'].dt.month

0        NaN
1        1.0
2       10.0
3        8.0
4        2.0
        ... 
8996    11.0
8997    11.0
8998    11.0
8999    11.0
9000    11.0
Name: data_date_new, Length: 9001, dtype: float64

# In-Class Exercise 
Using the `FDNY_Firehouse_Listing.csv` dataset, show which neighborhoods (NTAs) have the most Firehouses. 

In [None]:
## INSERT YOUR CODE HERE