# Reading and Cleaning Data

In [17]:
import pandas as pd

As part of the course you [collect information](../collecting/collecting.md) about different canvases and graffiti found on them throughout the city. Here we will use a sample of the information collected to illustrate how to process this information with Pandas.

You can find these tables in the `/tables` sub- folder or directory.

### Read csv files

Before we upload the data, we will use the Python built-in package called `pathlib` to generate a path to our data files. This package is very useful when running notebooks on different operating systems.

In [18]:
from pathlib import Path

# creating a relative path to the data folder 
pth = Path('../../data')
pth

WindowsPath('../../data')

Once a path has been created with `Path`, you can easily append to it using `/` character.

In [19]:
# creating a complete filename 
fn = pth / 'canvas.csv'
fn

WindowsPath('../../data/canvas.csv')

Let's load the 'canvas.csv' dataset as a Pandas DataFrame named *canvas*

In [20]:
# read canvas csv file into a dataframe 
canvas = pd.read_csv(fn)

## Display Data

Before we display the datasets, we are going to limit the maximum number of rows to display to 50. You can change various Pandas' [default display settings](https://pandas.pydata.org/pandas-docs/stable/user_guide/options.html) to your own liking.

In [22]:
pd.options.display.max_rows = 20

In [23]:
canvas

Unnamed: 0,id,created_at,uploaded_at,created_by,title,at_canvas,coords,date_entry_canvas,property_type,property_use,surveillance_status,surveillance,canvas_location,canvas_nature,surface_material,graffiti_removal,viewing_potential,accessibility
0,0,2023-11-27 13:35:31-08:00,2023-11-27 13:40:43-08:00,jsomer@uw.edu,11/27/2023 Wall,Y,"{'latitude': 47.658577, 'longitude': -122.3176...",11/27/2023,comercial,abandoned,N,[],street,wall,['concrete'],N,medium,['street_Level']
1,1,2023-11-27 13:34:42-08:00,2023-11-27 13:40:40-08:00,jsomer@uw.edu,11/27/2023 Wall,Y,"{'latitude': 47.658577, 'longitude': -122.3176...",11/27/2023,comercial,abandoned,N,[],street,wall,['concrete'],N,medium,['street_Level']
2,2,2023-11-27 13:33:35-08:00,2023-11-27 13:40:37-08:00,jsomer@uw.edu,11/27/2023 Wall,Y,"{'latitude': 47.658577, 'longitude': -122.3176...",11/27/2023,comercial,abandoned,N,[],street,wall,['concrete'],N,medium,['street_Level']
3,3,2023-11-27 13:32:35-08:00,2023-11-27 13:40:26-08:00,jsomer@uw.edu,11/27/2023 Wall,Y,"{'latitude': 47.658577, 'longitude': -122.3176...",11/27/2023,comercial,abandoned,N,[],street,wall,['concrete'],N,medium,['street_Level']
4,4,2023-11-27 13:06:08-08:00,2023-11-27 13:33:15-08:00,kbtellez@uw.edu,11/27/2023 Wall,Y,"{'latitude': 47.661615, 'longitude': -122.3204...",11/27/2023,comercial,undetermined,Y,"['cameras', 'lights', 'people', 'alarms']",street,wall,['asphalt'],N,medium,['street_Level']
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
1456,1456,2023-10-03 17:03:18.380000-07:00,2023-10-03 17:24:54-07:00,kp10@uw.edu,10/03/2023 Wall,Y,"{'latitude': 47.66173, 'longitude': -122.31275...",10/03/2023,comercial,in_use,Y,['cameras'],alley,wall,['concrete'],N,low,['street_Level']
1457,1457,2023-10-03 17:00:24.229000-07:00,2023-10-03 17:24:48-07:00,kp10@uw.edu,10/03/2023 Wall,Y,"{'latitude': 47.661539, 'longitude': -122.3124...",10/03/2023,comercial,in_use,Y,['cameras'],alley,wall,['concrete'],Y,low,['street_Level']
1458,1458,2023-10-02 18:15:21.497000-07:00,2023-10-02 18:17:18-07:00,kp10@uw.edu,10/02/2023 Mailbox,Y,"{'latitude': 47.663798, 'longitude': -122.3153...",10/02/2023,residential,in_use,Y,"['lights', 'alarms', 'people']",street,mailbox,['metal'],N,medium,['street_Level']
1459,1459,2023-10-02 10:47:08.557000-07:00,2023-10-02 10:50:14-07:00,kp10@uw.edu,10/02/2023 Other,Y,"{'latitude': 47.663486, 'longitude': -122.3153...",10/02/2023,residential,in_use,Y,"['cameras', 'alarms', 'people']",street,other,['metal'],Y,medium,['street_Level']


If you scroll down or sideways you will encounter some ellipsis `...`. Pandas, inserts this to let you know that there are more rows and columns in between that are not currently displayed.   
If you want to show the first five rows you can use `.head()` pandas' function.

In [24]:
# Show the first two rows
canvas.head(2)

Unnamed: 0,id,created_at,uploaded_at,created_by,title,at_canvas,coords,date_entry_canvas,property_type,property_use,surveillance_status,surveillance,canvas_location,canvas_nature,surface_material,graffiti_removal,viewing_potential,accessibility
0,0,2023-11-27 13:35:31-08:00,2023-11-27 13:40:43-08:00,jsomer@uw.edu,11/27/2023 Wall,Y,"{'latitude': 47.658577, 'longitude': -122.3176...",11/27/2023,comercial,abandoned,N,[],street,wall,['concrete'],N,medium,['street_Level']
1,1,2023-11-27 13:34:42-08:00,2023-11-27 13:40:40-08:00,jsomer@uw.edu,11/27/2023 Wall,Y,"{'latitude': 47.658577, 'longitude': -122.3176...",11/27/2023,comercial,abandoned,N,[],street,wall,['concrete'],N,medium,['street_Level']


Let's do the same with the 'graffiti.csv' dataset and load it as a Pandas DataFrame named *graffiti*

In [25]:
# read graffito
graffiti = pd.read_csv(pth/'graffiti.csv')

In [26]:
graffiti

Unnamed: 0,id,canvas_id,created_at,uploaded_at,created_by,title,num,date_recorded,width,height,...,technique,marker_type,nip_type,other,num_colors,colors,nature_graffiti,transcribable,message,transcription
0,0,3,2023-11-27 13:40:11-08:00,2023-11-27 13:40:35-08:00,jsomer@uw.edu,11/27/2023 “Roja”,1,11/27/2023,91,35,...,spray,,,,2,"['black', 'white']","['Image', 'Text']",Y,writter,“Roja”
1,1,3,2023-11-27 13:39:13-08:00,2023-11-27 13:40:33-08:00,jsomer@uw.edu,11/27/2023 Triangle/prism,1,11/27/2023,91,60,...,spray,,,,3-5,"['black', 'white', 'red', 'gold']",['Image'],Y,other,Triangle/prism
2,2,3,2023-11-27 13:37:39-08:00,2023-11-27 13:40:29-08:00,jsomer@uw.edu,11/27/2023,1,11/27/2023,25,32,...,marker,marker,unknown,,1,['white'],['Text'],N,,
3,3,3,2023-11-27 13:36:56-08:00,2023-11-27 13:40:27-08:00,jsomer@uw.edu,11/27/2023,1,11/27/2023,365,121,...,spray,,,,3-5,"['white', 'black', 'blue']",['Text'],N,,
4,4,4,2023-11-27 13:31:49-08:00,2023-11-27 13:33:51-08:00,kbtellez@uw.edu,11/27/2023 Uhkrew,1,11/27/2023,53,23,...,spray,,,,1,['black'],['Text'],Y,writter,Uhkrew
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
3628,3628,1455,2023-10-03 17:06:45.836000-07:00,2023-10-03 17:25:00-07:00,kp10@uw.edu,10/03/2023 SNIDE,1,10/03/2023,120,90,...,spray,,,,3-5,"['blue', 'green', 'white', 'cyan']",['Text'],Y,writter,SNIDE
3629,3629,1456,2023-10-03 17:04:20.503000-07:00,2023-10-03 17:24:55-07:00,kp10@uw.edu,10/03/2023 CARPO (VOA),1,10/03/2023,120,90,...,spray,,,,2,"['black', 'yellow']",['Text'],Y,writter,CARPO (VOA)
3630,3630,1457,2023-10-03 17:01:29.572000-07:00,2023-10-03 17:24:50-07:00,kp10@uw.edu,10/03/2023 CARPO,1,10/03/2023,120,90,...,spray,,,,1,['black'],['Text'],Y,writter,CARPO
3631,3631,1458,2023-10-02 18:16:29.305000-07:00,2023-10-02 18:17:21-07:00,kp10@uw.edu,10/02/2023 MUTE,1,10/02/2023,15,30,...,marker,marker,chiseled,,1,['black'],['Text'],Y,writter,MUTE


This time we will display the last 5 rows using  [`.tail()`]() pandas' function.

In [27]:
# Show the first three rows
graffiti.tail(3)

Unnamed: 0,id,canvas_id,created_at,uploaded_at,created_by,title,num,date_recorded,width,height,...,technique,marker_type,nip_type,other,num_colors,colors,nature_graffiti,transcribable,message,transcription
3630,3630,1457,2023-10-03 17:01:29.572000-07:00,2023-10-03 17:24:50-07:00,kp10@uw.edu,10/03/2023 CARPO,1,10/03/2023,120,90,...,spray,,,,1,['black'],['Text'],Y,writter,CARPO
3631,3631,1458,2023-10-02 18:16:29.305000-07:00,2023-10-02 18:17:21-07:00,kp10@uw.edu,10/02/2023 MUTE,1,10/02/2023,15,30,...,marker,marker,chiseled,,1,['black'],['Text'],Y,writter,MUTE
3632,3632,1459,2023-10-02 10:49:50.901000-07:00,2023-10-02 18:17:15-07:00,kp10@uw.edu,10/02/2023 BEASTER 2/10,1,10/02/2023,60,30,...,marker,mops,,,1,['black'],['Text'],Y,writter,BEASTER 2/10


## Cleaning Data

A very large portion of time with any Data Science project (yes, this includes Archaeology) is spent cleaning data in order to prepare it for later stages like analysis, modeling and visualization.

Cleaning data means, processing the data ir order to take care of any missing or errouneous data. Missing data is usually easier to spot than erroneous data. Also, you might have erronous entries just in a few columns but not  all of them. Ultimately, is up to you, the analyst, to decide what data should be eliminated before proceeding to the next stage.

### Missing data

Anytime you read information into a DataFrame, Pandas will flag missing entries using a `NaN` value. For instance, if you check the graffiti DataFrame above you will find this value inserted in various rows in several columns such as *marker_type*, *nip_type*, and so on.

Pandas offers several methods to detect and handle `NaN` entries.

### `isna()` and `notna()`

Both of these methods allow us to identify where we have `NaN` entries. For instance, let's imagine that you are interested in only processing those graffiti entries for which we have *marker_type* information.

In [28]:
graffiti.marker_type.notna()

0       False
1       False
2        True
3       False
4       False
        ...  
3628    False
3629    False
3630    False
3631     True
3632     True
Name: marker_type, Length: 3633, dtype: bool

We can use the above to select only those rows that have some entry in the *marker_type* column

In [29]:
graffiti[graffiti.marker_type.notna()]

Unnamed: 0,id,canvas_id,created_at,uploaded_at,created_by,title,num,date_recorded,width,height,...,technique,marker_type,nip_type,other,num_colors,colors,nature_graffiti,transcribable,message,transcription
2,2,3,2023-11-27 13:37:39-08:00,2023-11-27 13:40:29-08:00,jsomer@uw.edu,11/27/2023,1,11/27/2023,25,32,...,marker,marker,unknown,,1,['white'],['Text'],N,,
30,30,6,2023-11-27 10:56:25-08:00,2023-11-27 10:59:34-08:00,jalcon05@uw.edu,11/27/2023,1,11/27/2023,3,3,...,marker,marker,chiseled,,1,['white'],['Text'],N,,
31,31,6,2023-11-27 10:55:52-08:00,2023-11-27 10:59:31-08:00,jalcon05@uw.edu,11/27/2023,1,11/27/2023,4,5,...,marker,marker,chiseled,,1,['white'],['Text'],N,,
32,32,6,2023-11-27 10:55:09-08:00,2023-11-27 10:59:29-08:00,jalcon05@uw.edu,11/27/2023,1,11/27/2023,5,6,...,marker,marker,chiseled,,1,['white'],"['Text', 'Image']",N,,
33,33,6,2023-11-27 10:54:20-08:00,2023-11-27 10:59:28-08:00,jalcon05@uw.edu,11/27/2023,1,11/27/2023,9,16,...,marker,painstick/chalk,,,1,['white'],['Text'],N,,
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
3622,3622,1447,2023-10-05 13:09:01.224000-07:00,2023-10-05 13:09:08-07:00,aimerino@uw.edu,10/05/2023,1,10/05/2023,40,13,...,marker,marker,chiseled,,1,['green'],['Text'],N,,
3623,3623,1448,2023-10-05 13:08:42-07:00,2023-10-05 16:27:03-07:00,tlovett@uw.edu,10/05/2023,1,10/05/2023,2,4,...,marker,marker,chiseled,,1,['pink'],['Text'],N,,
3627,3627,1454,2023-10-03 17:12:36.007000-07:00,2023-10-03 17:25:04-07:00,kp10@uw.edu,10/03/2023 KAYP WOOB,1,10/03/2023,30,30,...,marker,mops,,,1,['yellow'],['Text'],Y,writter,KAYP WOOB
3631,3631,1458,2023-10-02 18:16:29.305000-07:00,2023-10-02 18:17:21-07:00,kp10@uw.edu,10/02/2023 MUTE,1,10/02/2023,15,30,...,marker,marker,chiseled,,1,['black'],['Text'],Y,writter,MUTE


## `.dropna()`

Another possibility is using the [`.dropna()`](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.dropna.html) method.

In [32]:
graffiti.dropna(subset='marker_type')

Unnamed: 0,id,canvas_id,created_at,uploaded_at,created_by,title,num,date_recorded,width,height,...,technique,marker_type,nip_type,other,num_colors,colors,nature_graffiti,transcribable,message,transcription
2,2,3,2023-11-27 13:37:39-08:00,2023-11-27 13:40:29-08:00,jsomer@uw.edu,11/27/2023,1,11/27/2023,25,32,...,marker,marker,unknown,,1,['white'],['Text'],N,,
30,30,6,2023-11-27 10:56:25-08:00,2023-11-27 10:59:34-08:00,jalcon05@uw.edu,11/27/2023,1,11/27/2023,3,3,...,marker,marker,chiseled,,1,['white'],['Text'],N,,
31,31,6,2023-11-27 10:55:52-08:00,2023-11-27 10:59:31-08:00,jalcon05@uw.edu,11/27/2023,1,11/27/2023,4,5,...,marker,marker,chiseled,,1,['white'],['Text'],N,,
32,32,6,2023-11-27 10:55:09-08:00,2023-11-27 10:59:29-08:00,jalcon05@uw.edu,11/27/2023,1,11/27/2023,5,6,...,marker,marker,chiseled,,1,['white'],"['Text', 'Image']",N,,
33,33,6,2023-11-27 10:54:20-08:00,2023-11-27 10:59:28-08:00,jalcon05@uw.edu,11/27/2023,1,11/27/2023,9,16,...,marker,painstick/chalk,,,1,['white'],['Text'],N,,
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
3622,3622,1447,2023-10-05 13:09:01.224000-07:00,2023-10-05 13:09:08-07:00,aimerino@uw.edu,10/05/2023,1,10/05/2023,40,13,...,marker,marker,chiseled,,1,['green'],['Text'],N,,
3623,3623,1448,2023-10-05 13:08:42-07:00,2023-10-05 16:27:03-07:00,tlovett@uw.edu,10/05/2023,1,10/05/2023,2,4,...,marker,marker,chiseled,,1,['pink'],['Text'],N,,
3627,3627,1454,2023-10-03 17:12:36.007000-07:00,2023-10-03 17:25:04-07:00,kp10@uw.edu,10/03/2023 KAYP WOOB,1,10/03/2023,30,30,...,marker,mops,,,1,['yellow'],['Text'],Y,writter,KAYP WOOB
3631,3631,1458,2023-10-02 18:16:29.305000-07:00,2023-10-02 18:17:21-07:00,kp10@uw.edu,10/02/2023 MUTE,1,10/02/2023,15,30,...,marker,marker,chiseled,,1,['black'],['Text'],Y,writter,MUTE


```{caution}
None of the above have **actually** changed our original DataFrame yet. In order to do this, you need to overwrite the `graffiti` DataFrame with the new 'cleaned' version. E.g.
``` python
    grafitti = graffiti.dropna(subset='marker_type')
```

## `.fillna()`

In some circumstances you might just want to subsitute the missing `NaN` entries with some values, either numerical or text, before you continue with your analysis. For this you can use, the Pandas method [`.fillna()`](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.fillna.html) 

In [35]:
graffiti.marker_type.fillna('not recorded')

0       not recorded
1       not recorded
2             marker
3       not recorded
4       not recorded
            ...     
3628    not recorded
3629    not recorded
3630    not recorded
3631          marker
3632            mops
Name: marker_type, Length: 3633, dtype: object

```{caution} 
We have not changed the *maker_type* column yet. If you want to do this you need to overwrite the existing *marker_type* column with this new version as follows,
``` python
    graffiti['marker_type'] = graffiti.marker_type.fillna('not recorded')
```