We'll expand on our python data cleaning, analysis, and visualization skills by working with [Baltimore City Police Department data]() from the Baltimore City Open Data portal.

## import packages 

In [182]:
# import packages to conduct data analysis and to make interactive charts and maps


## import data

[Previously](https://colab.research.google.com/drive/14PJV4aPg01xX7-XnIUtrC6_GqgeriZj1) we imported data from a link to a csv uploaded to GitHub. Here, we'll upload data from a file on our computer. This might be useful if we don't have a GitHub profile or if we have a file that:
 - Is larger than 25 MB, which won't upload to GitHub
 - We don't want to convert to a CSV (e.g. part of already performed Excel analysis, etc.)
 - Is proprietary data/not our own data that we don't want to upload even to a private GitHub repository
 



### Importing a Data File from Your Local Machine 

Similar to saving the website link as a __variable__, here we'll save the file path name to our file as a variable. If we drag the data file into the same file as our jupyter lab notebook, we don't need to list the path file names outside of the folder we're working in, however, if our data file is in a different location on our machine--and if we don't want to move it--then we'll need to list the entire path name. 

We can get the path name for our file by: 
 - __MacOS:__ CONTROL + click and while menu is up, press the OPTION key. The menu options should change, and you should see a menu option that says "Copy "filename.csv" as Pathname." Click this to copy the file pathname. 
 - __Windows:__ SHIFT + right click, choose "Copy as Path"

In [183]:
# save file name as a variable with the file name in ""


In [184]:
# import data as a dataframe to manipulate


In [222]:
# preview the data


## Inspecting Data

Now we can see what our data looks like, but we don't know much about what's happening within the dataset and what we can and can't potentially do with this. There are a few functions that we can use to gain some high-level insights into what our data has.

### General Big-Picture Counts and Stats 

In [223]:
# to look at the stats on the columns we have numbers in we can use df.describe() to give us 
#the count, mean, maximum, minimum, standard deviation, and percentiles within those columns



In [224]:
# we can also use df.info() to see the data types within columns
# this will be important if some of our "number" columns aren't actually "numbers"
# or if we need to convert columns to dates or times


## Data Cleaning 

There are a few types of __data types__ that we'll need to be concerned with in our analysis for this class: 

 - __object or string (str)__: these are any combination of letters, numbers, and characters that are essentially an entity of data. We can manipulate these objects by splitting on specific characters, adding or subtracting, grouping by similarities, etc.
 - __integer (int)__: these are integers. We can perform any arithmetic function with these as long as the equation elements are also integers or...
 - __float (float)__: these are numbers that have a decimal in them. Similar to integers, we can perform any arithmetic functions on these values
 - __datetime (datetime)__: these are dates, times, or both dates and times. It's advantageous to convert actual date/time data into a datetime format so that we can perform arithmetic on these values (e.g. subtracting dates to get the number of days in between, adding times to get an end time, etc.)
 - __Boolean (bool)__: these are data types that are either True or False. We use boolean data types a lot in Python logic expressions

Most of our numerical data in our BPD dataset is already classified as an int, but our date and time values are classified as "objects." We can convert our date and time columns into date/time values by "redefining" them.

In [225]:
# remove all non numerical values from ArrestDate and ArrestTime


Next, we'll create a new column that combines the arrest date and time so that we can convert this to a datetime datatype and manipulate the data.

In [189]:
# combine arrest date and time columns


In [226]:
# preview the new column


In [227]:
# convert ArrestDateTime column to datetime



In [228]:
# now check to see that we converted our column to a datetime format


In [229]:
# preview data


## Data Manipulation

Now that we have our data set up with the correct data types, we can start to dive deep and aggregate this information to better understand what's happening with the Baltimore City Police Department and the arrests over time. 

#### Arrest Year Column 

Let's look at how a few of these column values change each year. To make this easier, we'll create a new column for the _ArrestYear_ from the ArrestDate column, and then aggregate some variables by the year column. 

In [230]:
# make a new column for the ArrestDate year


We may also want to look at arrests by month or day of the week, so we'll add these columns as well

In [231]:
# make columns to define the month and day of the week

 #Monday = 0, Sunday = 6

In [232]:
# check new columns


#### How have the number of crimes in the daytime, evening, and night changed over the years?

We'll define daytime, evening, and night as: 
 - __morning__: 12 AM-8 AM
 - __daytime__: 8 AM - 4 PM
 - __night__: 4 PM-12 AM

and categorize each arrest automatically by separating the time of day into equal bins

In [233]:
# first, make a new column that extracts the arrest time hour


In [234]:
# separate our ArrestHour column into bins


In [235]:
# preview new column

Now, we want to create an aggregated table of the number of arrests per day "segment" for each year. Previously we did this by using the Python version of a pivot table. Here, we'll use the pandas.groupby function to similarly group our column values. 

The general formula for a pandas groupby function is: 

```
new_df = df.groupby("columns you want to aggregate")["columns you want as the values/what you will perform functions on"].agg([calculation_you_want_to_perform])
```

Here, we'll aggregate/group the dataframe by arrest year __and__ day segment and count the number of arrests in each segment by using the arrest ID as the unique identifier to count values.

In [236]:
# make an aggregated dataframe to look at the number of arrests in each day segment over the years available



In [237]:
# preview new dataframe


## Data Visualizations 

Here, we'll make a few visualizations to look at this distribution in interactive charts with plotly express

```
line_timeofday_arrests = px.line(df,
                                x = "",
                                y = "",
                                color = "",
                                hover_name = "",
                                title = "",
                                labels = {"": "", "": ""},
                                )
```

#### Line Graph to Compare Number of Arrests per year 

In [238]:
# line graph of number of arrrests in each time period,



In [239]:
# view line graph


In [207]:
# export the visual to an html to share
line_timeofday_arrests.write_html("line_graph_bpd_arrest_dayperiod.html")

#### Pie Chart to compare the distribution of arrests throughout the week 

In [210]:
# make a pie chart to show distribution of arrests during the week
week_dist_arrests = px.pie(df_bpd, 
                           values="", 
                           names="")

In [240]:
# show pie chart


#### Plotly Animation to show changes in number of arrests over years

In [214]:
# aggregate data to count the number of arrests per weekday per year


In [241]:
# preview new df


In [242]:
# look up the maximum value for number of arrests to add into the animation 


In [243]:
# make animated bar chart to show changes in number of arrests during each week day over the years provided


In [244]:
# view animation


In [221]:
# export animation to html
animation_bar_arrests.write_html("weekday_arrest_trend_animation.html")