# PANDAS CAPSTONE PROJECT- SCHOOL SAFETY

![scs.png](scs.png)
    
<img src="https://www.vernonpublicschools.org/uploaded/Safety_and_Security/crssafety.jpg" width="310/"> 

* Since 1998, the New York City Police Department (NYPD) has been tasked with the collection and maintenance of crime data for incidents that occur in New York City public schools. For presentation purposes, each incident has been classified in one of three categories. These categories are:
<br>
**Major Crimes:** This category is consistent with those regularly and publicly reported by the NYPD. It includes the most serious personal and property crimes. The property crimes are burglary, grand larceny and grand larceny auto. The crimes against persons are murder, rape, robbery and felony assault.
<br>
**Other Crimes:** This category is composed of many crimes and incidents that range in severity. It includes reports of incidents such as arson/explosion, misdemeanor assault, criminal possession or sale of a controlled substance, sale of marijuana, criminal mischief, petit larceny, reckless endangerment, sex offenses (not including rape, which is included in the Major Crimes), and weapons possession.
<br>
**Non-Criminal Incidents:** This category includes actions which are not classified as crimes but are nevertheless disruptive to the school environment. It includes disorderly conduct, harassment, loitering, possession of marijuana, dangerous instruments and trespass.
<br>
NYPD and NYC Department of Education are stored this crime data as annualy school safety reports and published on https://www.data.gov/ . <br>
 __In this Data Analysis exercise, I concatenated the School Safety Reports of the 2015 and 2016, and I will try to Analyse this data.__ <br>


### IMPORTING LIBRARIES
* Import `numpy` as `np` and `pandas` as `pd`

In [4]:
import numpy as np 
import pandas as pd

### STEP 1: EXAMINING DF 

#### IMPORTING DATA

* Import the school safety data and name it as:
    * 2015_16ss: `ss1516`
    * 2016_17ss: `ss1617`
* Don't forget to set `encoding="utf-8"` , `quotechar='"', and `delimiter=","`

* Create `ss1517` by concating `ss1516` and `ss1617`
* Use `shape` to figure out how many rows and columns our `ss1517` has.

* Print the first 3 rows of `ss1517`

#### Explanation of the Columns is needed to understand our analysis better <br>
* __Location Name__ is the the name by which the organization is known. For a learning community, it is the official title of the school. <br>
* __Location Code__ is a unique identifier that can include schools, administrative offices, learning communities, etc. <br>
* __Borough__ is the NYC Boro the location is situated in. <br>
* __Geographical District Code__ the school’s geographical district as defined by the NYC Department of Education. <br>
* __Register__ Number of students on register. <br>
* __Building Name__ is the the official name of the building a school is located in. <br>
* __# Schools__ is the number of schools in in the building. <br>
* __Schools in the Building__ is the names of the schools in the buildings. <br>
* __Major N__ is the number of major crimes. <br>
* __Oth N__ is the number of other crimes. <br>
* __NoCrim N__ is the number of non - criminal crimes. <br>
* __Prop N__ is the number of property crimes. <br>
* __Vio N__ is the number of violent crimes. <br>
* __EnGroup A__ is the building population. <br>
* __Range A__ is the group name the building population falls under. <br>
* __AvgofMajorN__ is the average of major crimes for all buildings that have the same EnGroupA/Range A. <br>
* __AvgofOthN__ is the average of other crimes for all buildings that have the same EnGroupA/Range A. <br>
* __AvgofNoCrimN__ is the average of non-criminal crimes for all buildings that have the same EnGroupA/Range A. <br>
* __AvgofPropN__ is the average of property crimes for all buildings that have the same EnGroupA/Range A. <br>
* __AvgofVioN__ is the average of violent crimes for all buildings that have the same EnGroupA/Range A. <br>
---
Let's take a breif look of our data.

* Use `.info()` to get summary information about the `ss1517`

While 8 of our column's dtpyes are object, 12 of them's dtypes are float.
___

Let's examine the summary statistics of our ss1517 df:

* Use `.describe()` to receive summary statistics about `ss1517`

### STEP 2: LOCATING & REMOVING NA VALUES

Let's check whether our df has Na values or not:
* Use `isnull().values.any()` for this purpose

Appearently, we have some Na values. Let's figure out how many Na values we have:

* Use `isnull().values.sum()` to see how many NA values we have in `ss1517`

19392 of our values are Na. Wow!, that's a lot. In that case, we have 3 options: 
<br>
1) We can get rid of them with `ss1517.dropna()` . We can do this but we also loose a lot of useful information because `.dropna()`__drops entire column that has Na value__ , and not every value in that column is Na. That's why we won't go with dropna().
<br>
2) We can use the `value` parameter of the `fillna()` function. In that case, we can only replace Na 's with one variable. If we try to replace them with int or float, we 'll also replace the Na values in the object columns, and the reverse is also applicable. It seems little messy. <br>
3) We can use the `method`parameter of the `fillna()`function. We can set `method='ffill'`to replace Na values with the last valid observation, or we can set `method = 'bfill'`to replace Na values with the next valid observation. If we first set `method = 'ffill'`and then set `method = 'bfill'`, we can remove all Na values with the same dtype of columns. That way we are able to protect our dataframe's structure. That's why we'll go with this option. 

* Fill the NA values or drop them with the relevant method(s). Briefly explain why do you choose the particular method or why don't you choose the others..

Na values check, once more..

* Recheck whether you have NA values in `ss1517` or not with `isnull().values.any()`

* Take a look at your data with `.info()` and evaluate your data within `ss1517`

Great ! We don't have any Na values. 
___

Let's check whether we made any changes after we implement `fillna()`
* Use `info` for that purpose

Great ! Everything seems to be in order.
* Print the first 3 rows of your df.

How beautiful does our data sets look like without Na values..

### STEP 3: ANALYSING DATA

__In this dataset, major crimes are coded like this:__<br>
Burglary - 0<br>
Grand larceny - 1<br>
Grand larceny auto - 2<br>
Murder - 3<br>
Rape - 4 <br>
Robbery - 5<br>
Felony - 6<br>
Assault - 8<br>
___
Let's check it out that how is major crimes' distribution in the number of Major Crimes , a.k.a. `Major N`
* For that purpose, you can use `unique()` method.
* Name your result as `avgmajor`

Though we have 7 different crimes, Burglary(0) and Grand Larcery(1) are so dominant that other crimes couldn't show themselves on the `Major N` .This explanation will be understood better once we plot our `Major N`(Number of Major Crimes) column.
___

Let's separate these two and dive deeper. and name it as `bigtwo` <br>
Hint: You can use the following structure: `df[df["Column"] <= yourfilter]`

### BIG TWO

In [10]:
bigtwo = ss1517[ss1517["Major N"] <= 1]

Let's examine our big two.
* Start with `shape`, then proceed with `info()`
* You can also print the first five columns for that purpose.

There might be relevance between the tow big crimes and the Borough that they are commited. 
___
* To figure out relevance, we need to know our Borough values. Let's find the unique values in Borough column by `unique()`function. <br>
* Use `unique()` and return the unique values of `Borough` column of `bigtwo`


In that case : <br>
__M__ represents Manhattan. <br>
__Q__ represents Queens. <br>
__R__ represents Rikers Island. <br>
__K__ represents Brooklyn. <br>
__X__ represents The Bronx. <br>
__O__ represents Staten Island. <br>

* Now, as we know Borough's and their actual names, we can examine the relationship between the bigtwo and Boroughs. We can do this through grouping them by Borough's. In that case, the most handy tool is  `groupby()`function.
* Use `groupby()` and group `bigtwo` by it's `Borough`

* Evaluate the result **with your own words**. Remember, Data Science is all about explaining the story of the data. So, try your best and seeze the story behind the data..

Well, it seems that burglary and grand larceny crimes are commited in Brooklyn first place, then the Bronx, then Manhattan. We can say that burglary and grand larceny crimes in NYC Public Schools are not relevant with the wealth because the richest borough of the NYC is Manhattan; and there are more burglary and grand larceny crimes commited in Manhattan than the poorest Borough in NYC, the Bronx.
___
Hmm, how about adding a new variable in our equasion and looking from different perspective ? <br>
Let's examine our big two by the Borough's and the number of students in each schools, a.k.a. `Register` <br>
That way, we can evaluate our bigtwo not only with Borough's and the wealth of them, but also with the population's of Borough's and their effects on burglary and grand larceny.
___


Now, we can say that there are much more grand larceny crime is commited than burglary in Manhattan. For other Borough's , everything is pretty much same. 

### OTHER MAJOR CRIMES

Now, it is time to talk about the other major crimes : <br>
Grand larceny auto - 2<br>
Murder - 3<br>
Rape - 4 <br>
Robbery - 5<br>
Felony - 6<br>
Assault - 8<br>
___
As they did not occur as frequent as the burglary and the grand larceny, their contest is much more serious than bigtwo. Let's create a new dataframe and name it as `othercrimes`:
* Hint: You can use the following structure: `df[df["Column"] <= yourfilter]`

Let's examine the other crimes: 
* Start with `shape`, then proceed with `info()`
* You can also print the first five rows for that purpose.

* Print the first 3 rows of `othercrimes` df. 

It seems that we won't use some of the columns in our dataset. Let's get rid of them. <br>
We can do this by `drop()`function of Pandas. With `drop`we can drop columns.
___
* After careful review, your team lead thinks that `Location Code` column is unnecessary for our analysis.
* Drop the `Location Code` column from `othercrimes` df.
* After that, print the first 5 rows to check whether `Location Code` is dropped or not. 

Great ! In order to make smooother analysis, we'll only need 3 columns : <br>
`# Schools` : Number of schools in the building. We are going to need this column because different schools means different population categories such as age and culture. Different categories might be the major crimes in school. <br>
`EnGroupA`: Building population. <br>
`Major N`: Other major crimes.

To study with only these 3 columns, we need to reshape our df.We can do this by `loc`and `iloc`
* Update the `othercrimes` as it has only `# Schools`, `EnGroupA`, and `Major N` columns, nothing more. Use `loc` or `iloc` for this purpose.

* Print the first 5 rows of your df's updated version.

* Let's sort the values by Major N codes to see the relevance better.
* Use `sort_values()` for that purpose.

* Great! To see the effects of Building Population and Number of Other Schools in the Building on Other Major Crimes, let's pivot it. We can pivot it by `pivot_table`function of pandas.
* Use `pivot_table` as follows:
    * For index, use `EnGroupA`,
    * For columns, use `# Schools`
    * For values, use `Major N`

We have Na values, again.. Replace them as we did earlier, with `bfill`and `ffill`parameters of `pd.fillna()`

Print the first 10 columns of `pivot_table` **df**.

Great ! In order to make smooother analysis, we'll only need 3 columns : <br>
`# Schools` : Number of schools in the building. We are going to need this column because different schools means different population categories such as age and culture. Different categories might be the major crimes in school. <br>
`EnGroupA`: Building population. <br>
`Major N`: Other major crimes.

* Evaluate your `pivot_table` df and final work with your own words.