# Methods

Let's read all the information before we can start,

In [1]:
import pandas as pd

#### Read csv files

In [2]:
from pathlib import Path

# creating a relative path to the data folder 
pth = Path('../../data')

In [3]:
# read canvas csv file into a dataframe 
canvas = pd.read_csv(pth / 'canvas.csv')

In [4]:
# read graffito
graffiti = pd.read_csv(pth/'graffiti.csv')

#### Display Data

In [5]:
pd.options.display.max_rows = 50

In [6]:
# Show the first two rows
canvas.head(2)

Unnamed: 0,id,created_at,uploaded_at,created_by,title,at_canvas,coords,date_entry_canvas,property_type,property_use,surveillance_status,surveillance,canvas_location,canvas_nature,surface_material,graffiti_removal,viewing_potential,accessibility
0,0,2023-11-27 13:35:31-08:00,2023-11-27 13:40:43-08:00,jsomer@uw.edu,11/27/2023 Wall,Y,"{'latitude': 47.658577, 'longitude': -122.3176...",11/27/2023,comercial,abandoned,N,[],street,wall,['concrete'],N,medium,['street_Level']
1,1,2023-11-27 13:34:42-08:00,2023-11-27 13:40:40-08:00,jsomer@uw.edu,11/27/2023 Wall,Y,"{'latitude': 47.658577, 'longitude': -122.3176...",11/27/2023,comercial,abandoned,N,[],street,wall,['concrete'],N,medium,['street_Level']


In [7]:
# Show the first three rows
graffiti.head(2)

Unnamed: 0,id,canvas_id,created_at,uploaded_at,created_by,title,num,date_recorded,width,height,...,technique,marker_type,nip_type,other,num_colors,colors,nature_graffiti,transcribable,message,transcription
0,0,3,2023-11-27 13:40:11-08:00,2023-11-27 13:40:35-08:00,jsomer@uw.edu,11/27/2023 “Roja”,1,11/27/2023,91,35,...,spray,,,,2,"['black', 'white']","['Image', 'Text']",Y,writter,“Roja”
1,1,3,2023-11-27 13:39:13-08:00,2023-11-27 13:40:33-08:00,jsomer@uw.edu,11/27/2023 Triangle/prism,1,11/27/2023,91,60,...,spray,,,,3-5,"['black', 'white', 'red', 'gold']",['Image'],Y,other,Triangle/prism


### Methods

Pandas has many methods that can be applied to an entire DataFrames or a single Series in order to query and generate new information.

To illustrate a few of these operations, create a new column representing the area of each graffito recorded.

|Methods|Explanation|
|:-----:|:---------:|
[`.describe`](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.describe.html)| Provides a quick numerical summary|
|[`.idmax()`](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.idxmax.html)|Return index of first occurrence of the maximum value|
|[`.idmin()`](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.idxmin.html)|Return index of first occurrence of the minimum value|
|[`.count()`](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.count.html)|Number of observations|
|[`.sum()`](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.sum.html)|Sum of all values|
|[`.min()`](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.min.html)|Minimum value|
|[`.mean()`](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.mean.html)|Mean of values|
|[`.median()`](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.median.html)|Median of values|
|[`.mode()`](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.mode.html)|Mode of values|
|[`.max()`](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.max.html)|Maximum value|
|[`.std()`](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.std.html)|Standard Deviation of values|
|[`.quantile()`](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.quantile.html)|Calculates a given quantile of the values|
|[`.sample()`](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.sample.html)|Returns a sample (with or without replacement of the DataFrame|

In [8]:
pd.options.display.float_format = '{:10,.2f}'.format

In [9]:
graffiti['area'] = graffiti.width * graffiti.height

In [10]:
# area describe
graffiti.area.describe()

count       3,633.00
mean        8,186.00
std        58,371.47
min             0.00
25%            80.00
50%           420.00
75%         2,700.00
max     3,000,000.00
Name: area, dtype: float64

In [11]:
# which graffiti have the largest and smallest areas?
graffiti.area.idxmax(), graffiti.area.idxmin()

(2395, 1998)

In [12]:
# which rows have the largest width and height?
graffiti[['width','height']].idxmax()

width     2748
height    2395
dtype: int64

```{caution}
There might be various entries with the same minimum and maximum values.
```

In [13]:
graffiti.area.count()

np.int64(3633)

In [14]:
graffiti.area.sum()

np.int64(29739738)

In [15]:
graffiti.area.min()

np.int64(0)

In [16]:
graffiti.area.mean()

np.float64(8186.0)

In [17]:
graffiti.area.median()

np.float64(420.0)

In [18]:
graffiti.area.mode()

0    50
Name: area, dtype: int64

In [19]:
graffiti.area.max()

np.int64(3000000)

In [20]:
graffiti.area.std()

np.float64(58371.47231150483)

In [21]:
graffiti.area.quantile([0.05, 0.10, 0.25, 0.5, 0.75, 0.90, 0.95])

0.05        18.00
0.10        30.00
0.25        80.00
0.50       420.00
0.75     2,700.00
0.90    15,000.00
0.95    36,738.40
Name: area, dtype: float64

In [None]:
# how many graffiti entries?
len(graffiti)

Create a sample with a sampling fraction of 0.1 or 10%

In [None]:
len(graffiti.sample(frac= 0.1))

Create a sample of size 100

In [None]:
# Speficying the sample size
len(graffiti.sample(n= 100))

Create a sample with a size 100 with replacement

In [None]:
# With replacement
graffiti.sample(n=100, replace= True)

## General Functions

Pandas has what are called [General Functions](https://pandas.pydata.org/pandas-docs/stable/reference/general_functions.html) which offer some powerful operations. We shall be seeing more of these later but for now, let's look at two that are very handy.

### `cut()` and `qcut()`

Both of these methods are very useful when trying to create categories out of a continous variable. The first method, [`cut()`](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.cut.html) breaks a continuous variable into various bins or chunks. 

In [24]:
# Create a new series where each observation is in a bin
graf_area =pd.cut(graffiti.area, bins=3)
graf_area[:5]

0    (-3000.0, 1000000.0]
1    (-3000.0, 1000000.0]
2    (-3000.0, 1000000.0]
3    (-3000.0, 1000000.0]
4    (-3000.0, 1000000.0]
Name: area, dtype: category
Categories (3, interval[float64, right]): [(-3000.0, 1000000.0] < (1000000.0, 2000000.0] < (2000000.0, 3000000.0]]

In [25]:
# label the categories
graf_area = pd.cut(graffiti.area, bins= 3, labels=['small', 'med', 'large'])
graf_area[:5]

0    small
1    small
2    small
3    small
4    small
Name: area, dtype: category
Categories (3, object): ['small' < 'med' < 'large']

[`qcut()`](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.qcut.html) is similar to `cut()` but instead you use quantiles to segmented the continuous variable.

In [28]:
# Use quartiles to create categories
graf_area = pd.qcut(graffiti.area, q=4)
graf_area[:5]

0    (2700.0, 3000000.0]
1    (2700.0, 3000000.0]
2        (420.0, 2700.0]
3    (2700.0, 3000000.0]
4        (420.0, 2700.0]
Name: area, dtype: category
Categories (4, interval[float64, right]): [(-0.001, 80.0] < (80.0, 420.0] < (420.0, 2700.0] < (2700.0, 3000000.0]]

In [29]:
# Specify percentiles.
graf_area, bins = pd.qcut(graffiti.area, q=[0, 0.05, 0.25, 0.75, 0.95, 1.0], labels=['tiny', 'small', 'average', 'large', 'huge'], retbins=True)
graf_area

0         large
1         large
2       average
3          huge
4       average
         ...   
3628      large
3629      large
3630      large
3631    average
3632    average
Name: area, Length: 3633, dtype: category
Categories (5, object): ['tiny' < 'small' < 'average' < 'large' < 'huge']

In [30]:
# show cut off values
bins

array([0.00000e+00, 1.80000e+01, 8.00000e+01, 2.70000e+03, 3.67384e+04,
       3.00000e+06])

In [None]:
graffiti.area.value_counts()

## Using Extensions

Whenever we have a field that has either text (=[`str`](https://pandas.pydata.org/pandas-docs/stable/user_guide/text.html#behavior-differences), a category (=[`cat`](https://pandas.pydata.org/pandas-docs/stable/user_guide/categorical.html#working-with-categories), or [`datetime`](https://pandas.pydata.org/pandas-docs/stable/user_guide/basics.html#dt-accessor), `Pandas` allows us to access specialized operations using a **.dot** notation.

In [31]:
# what are the categories?
graf_area.cat.categories

Index(['tiny', 'small', 'average', 'large', 'huge'], dtype='object')

In [32]:
# what codes?
graf_area.cat.codes

0       3
1       3
2       2
3       4
4       2
       ..
3628    3
3629    3
3630    3
3631    2
3632    2
Length: 3633, dtype: int8

In [33]:
# are they ordered?
graf_area.cat.ordered

True

In [34]:
# convert to string and capitalize
new_categories = graf_area.cat.categories.astype('string').str.capitalize().to_list()
new_categories

['Tiny', 'Small', 'Average', 'Large', 'Huge']

In [35]:
# Change categories
graf_area = graf_area.cat.set_categories(new_categories, rename=True)

In [36]:
# Show categories
graf_area.cat.categories

Index(['Tiny', 'Small', 'Average', 'Large', 'Huge'], dtype='object')