# Demo 1: Pandas and Visualizations

## Install and Load Libraries

We will need to install the pandas package. Can do it within the Anaconda navigator, through the command prompt/terminal, or from within jupyter notebook itself:

In [None]:
# !pip install pandas

Now let's import it.

In [None]:
import pandas as pd

## Series vs DataFrame

Before jumping into pandas, let's look at what a basic Python series looks like. It is one-dimensional:

In [None]:
series = pd.Series(['dog', 'cat', 'frog'])

In [None]:
series

In [None]:
colors = pd.Series(['blue','black','green'])
colors

We have an index (0, 1, 2) and the elements (dog, cat, frog).  
Now if we make a DataFrame, it is two-dimensional:

In [None]:
animal_df = pd.DataFrame({"Animal": series, "Colors": colors})

In [None]:
animal_df

Here, we see more of a table format. This is what most of the data we work with will look like (but larger!).

## Reading in our data!

If we are ever unsure about what a method is for or how to use it, jupyter notebook allows us to pull up the documentation by adding a '?' at the end of it. Try looking at the documentation for `pd.read_csv`.

This data is obtained from https://www.kaggle.com/datasets/jgiigii/uscrimesdataset

In [None]:
# df = pd.read_csv?

In [None]:
df = pd.read_csv

In [None]:
df = pd.read_csv

In [None]:
df = pd.read_csv

In [None]:
df = pd.read_csv("Crime.csv")

It is also possible to read in a csv from a URL rather than a local file. For example:

In [None]:
pd.read_csv("https://raw.githubusercontent.com/GoogleTrends/data/master/20170508_HealthCareSearchesUS.csv")

But for now, we will use our crime dataset.

## Describe our data

This will get us a list of our columns (aka our features).

In [None]:
df.columns

If we want to see our features *and* their data types, we can use the following:

In [None]:
df.dtypes

This gives information in the index values.

In [None]:
df.index

Information on the 'shape' of our data. The first value is the number of rows (how many items we have). The second value is the number of columns (how many features we have).

In [None]:
df.shape

The following gives a quick overview of our data. The count, mean, standard deviation, etc... of each column. Notice how we only see information on 7 columns when we have 30 total. This can only show information on the numerical columns.

In [None]:
df.describe()

## How to view and select data

To start, the below allows us to quickly view the first five entries.

In [None]:
df.head()

Let's say we only want to view three:

In [None]:
df.head(3)

Or we only want to view the last five:

In [None]:
df.tail()

### The difference between `.loc` and `.iloc`

Let's say for some reason, our indices are not in order from zero onwards, but are more scattered.

In [None]:
animals = pd.Series(['cat', 'dog', 'bird', 'frog'], index = [0, 3, 9, 8])

`.loc` will get us the item at **index of 3**

In [None]:
animals.loc[3]

`.iloc` will get us the item at the **position of 3** regardless of index values.

In [None]:
animals.iloc[3]

On selecting multiple items (this is very similar to the practice you all got with lists):

In [None]:
animals.iloc[1:]

What happens if we try to use `animals.loc[1:]`?

In [None]:
df.loc[2:5]

### Viewing a specific column

In [None]:
df['City']

### How do we filter our data?

In [None]:
df.columns

In [None]:
df['Crime Name1'].unique()

In [None]:
df_crimes = df[df['Crime Name1'] != 'Not a Crime']

In [None]:
df_crimes

What if we want to see the counts per `Crime Name1`? 

In [None]:
df_crimes.groupby(['Crime Name1']).size()

## Manipulating data

#### Editing string values

Let's say we need to standardize our data and have everything in `Crime Name1` be in lowercase?

In [None]:
df['Crime Name1'] = df['Crime Name1'].str.lower()

In [None]:
df.head()

#### Dealing with null values

First of all, let's take a look at the different columns that have null values. If an item is missing a particular column value, what could that mean? e.g., is it possible for a particular zip code to not be recorded properly? Do the end date times mean it is simply missing and someone forgot to record it, or maybe that information is not known to begin with?

In [None]:
df.isnull().sum()

How we end up dealing with nulls depends on the context of the data and the problem/questions you are looking to solve. We could simply drop all null rows in the dataset using `df.dropna()`. But what if one column has many many nulls, but we aren't really planning on using that column anyways? It may be better to drop that individual column instead, for example: `df.drop('End_Date_Time', axis=1)`. Note that `axis=0` specifies you are trying to drop rows (it is this by default), and `axis=1` specifies that you are trying to drop column(s). If you ever forget which is which, try looking at the docuentation using: `df.drop?`

Notice how this drops literally everything.

In [None]:
df.dropna()

If we run `df` again, we see that things weren't actually dropped. If we wanted to drop values *and* update the df, we would need to use either `df = df.dropna()` or `df.dropna('inplace=True')`

In [None]:
df.drop('End_Date_Time', axis = 1, inplace=True)

In [None]:
df

## Visualizing with Matplotlib!

With will explore two libraries for making visualizations in python. One is Matplotlib and the other is Seaborn. To start, let's install and import matplotlib:

In [None]:
!pip install matplotlib

In [None]:
import matplotlib.pyplot as plt

Take a look at the documentation first.

In [None]:
df.plot?

#### Crime counts

Remember how we used `groupby` earlier?

In [None]:
df.groupby('Crime Name1').size()

In [None]:
df.groupby('Crime Name1').size().plot(kind='bar')

Now perhaps we want to dig deeper into crime against property. First, let's filter out just those:

In [None]:
df_property_crimes = df[df['Crime Name1'] == 'Crime Against Property']

Now we can group by `Crime Name2` and plot that!

In [None]:
df_property_crimes.groupby('Crime Name2').size().plot(kind='bar')
plt.xlabel('Crime Names')

This is a bit small, so let's increase the size.

In [None]:
plt.rcParams['figure.figsize'] = (12,6)

In [None]:
df_property_crimes.groupby('Crime Name2').size().plot(kind='bar')

This is just *one small example* of things Matplotlib can do. It will be important to look through documentation, experiment, and ask questions! It is OK (and encouraged) to look things up (just make sure you cite your resources)!

## Some visualizations with Seaborn now!  
More info here: https://seaborn.pydata.org/  
They have a gallery and various tutorials.

Once again, let's import everything first.

In [None]:
!pip install seaborn

In [None]:
import seaborn as sns

#### We want to explore the crime rate across different times.

Let's make a new column for the date, and then for month, day, and hour.

In [None]:
df['date'] = pd.to_datetime(df['Dispatch Date / Time'])

The above will make it easier to prase month, day, and hour now. Also, here is a reference [link](https://www.kaggle.com/code/sady36/visualization-uscrimesdataset). (I admit, I was about to do this in a much slower and painful way before seeing this notebook. So again, don't be afraid to look things up and use your resources!)   
By the way, all of this below counts as pre-processing. We are making new columns based on the original `Dispatch Date / Time` one!

In [None]:
df["month"] = df["date"].apply(lambda x: x.month_name())
df["day_name"] = df["date"].apply(lambda x: x.day_name())
df["hour"] = df["date"].apply(lambda x: x.hour)

Look how nice this looks now!

In [None]:
df.head()

First, let's again filter out anything that is in the dataset and "not a crime."

In [None]:
df_filtered = df[df['Crime Name1'] != 'not a crime']

Now, let's visualize the counts of crimes per hour.

In [None]:
sns.countplot(x='hour', data=df_filtered)

In [None]:
df_filtered['Crime Name1'].unique()

Perhaps we want to only see the types of crime against property per hour.

In [None]:
sns.countplot(x='hour', data=df[df['Crime Name1'] == 'Crime Against Property'])

We don't see too much of a difference...

In [None]:
sns.countplot(x='hour', data=df[df['Crime Name1'] == 'Crime Against Society'])

While crimes against society are less than against property, we see a change in the distribution!

Now let's do crimes against people and give it a nice title so readers know what is happening here! A quick resource on how to add a title [here](https://www.tutorialspoint.com/how-to-add-a-title-on-seaborn-lmplot#).

In [None]:
plot = sns.countplot(x='hour', data=df[df['Crime Name1'] == 'Crime Against Person'])
ax = plt.gca()
ax.set_title("Counts of Crimes Against People per Hour of Day")
plt.show()

Notice how intentional we were about our question and the visualization we chose to make for it. And then it is possible for other questions to come up throughout that exploration.

In [None]:
df['Crime Name2'].unique()

In [None]:
df.groupby('Crime Name2').size().plot(kind='bar')

In [None]:
plot = sns.countplot(x='hour', data=df[df['Crime Name2'] == 'Robbery'])